diff --git a/12-spatial-cv.Rmd b/12-spatial-cv.Rmd index a9d3a9336..36467e304 100644 --- a/12-spatial-cv.Rmd +++ b/12-spatial-cv.Rmd @@ -11,12 +11,15 @@ Packages **GGally**, **lgr**, **kernlab**, **ml3measures**, **paradox**, **pROC* ```{r 12-spatial-cv-1, message=FALSE} library(dplyr) +library(future) +library(lgr) library(mlr3) library(mlr3learners) library(mlr3extralearners) library(mlr3spatiotempcv) library(mlr3tuning) library(mlr3viz) +library(progressr) library(sf) library(terra) ``` @@ -83,7 +86,7 @@ The landslide initiation point is located in the scarp of a landslide polygon. S There are 175 landslide and 175 non-landslide points, as shown by `summary(lsl$lslpts)`. The 175 non-landslide points were sampled randomly from the study area, with the restriction that they must fall outside a small buffer around the landslide polygons. -```{r lsl-map, echo=FALSE, fig.cap="Landslide initiation points (red) and points unaffected by landsliding (blue) in Southern Ecuador.", fig.scap="Landslide initiation points."} +```{r lsl-map, echo=FALSE, out.width="70%", fig.cap="Landslide initiation points (red) and points unaffected by landsliding (blue) in Southern Ecuador.", fig.scap="Landslide initiation points."} # library(tmap) # data("lsl", package = "spDataLarge") # ta = terra::rast(system.file("raster/ta.tif", package = "spDataLarge")) @@ -134,15 +137,7 @@ Since terrain attributes are frequently associated with landsliding [@muenchow_g - `elev`: elevation (m a.s.l.) as the representation of different altitudinal zones of vegetation and precipitation in the study area. - `log10_carea`: the decadic logarithm of the catchment area (log10 m^2^) representing the amount of water flowing towards a location. -Data containing the landslide points, with the corresponding terrain attributes, is provided in the **spDataLarge** package, along with the multilayer terrain attribute raster from which the values were extracted. -Hence, if you have not computed the predictors yourself, attach the corresponding data before running the code of the remaining chapter: - -```{r 12-spatial-cv-5} -# attach landslide points with terrain attributes -data("lsl", package = "spDataLarge") -# attach terrain attribute raster stack -ta = terra::rast(system.file("raster/ta.tif", package = "spDataLarge")) -``` +It might be a worthwhile exercise to compute the terrain attributes with the help of R-GIS bridges (see Chapter \@ref(gis)) and extract them to the landslide points (see Exercise section at the end of this Chapter). ## Conventional modeling approach in R {#conventional-model} @@ -191,7 +186,7 @@ In addition to a model object (`fit`), this function also expects a `SpatRaster` pred = terra::predict(ta, model = fit, type = "response") ``` -```{r lsl-susc, echo=FALSE, fig.cap="Spatial prediction of landslide susceptibility using a GLM.", fig.scap = "Spatial prediction of landslide susceptibility.", warning=FALSE} +```{r lsl-susc, echo=FALSE, out.width="70%",fig.cap="Spatial prediction of landslide susceptibility using a GLM.", fig.scap = "Spatial prediction of landslide susceptibility.", warning=FALSE} # # attach study mask for the natural part of the study area # data("lsl", "study_mask", package = "spDataLarge") # ta = terra::rast(system.file("raster/ta.tif", package = "spDataLarge")) @@ -300,11 +295,11 @@ There are dozens of packages for statistical learning\index{statistical learning Getting acquainted with each of these packages, including how to undertake cross-validation and hyperparameter\index{hyperparameter} tuning, can be a time-consuming process. Comparing model results from different packages can be even more laborious. The **mlr3** package and ecosystem was developed to address these issues. -It acts as a 'meta-package', providing a unified interface to popular supervised and unsupervised statistical learning techniques including classification, regression\index{regression}, survival analysis and clustering\index{clustering} [@lang_mlr3_2019]. +It acts as a 'meta-package', providing a unified interface to popular supervised and unsupervised statistical learning techniques including classification, regression\index{regression}, survival analysis and clustering\index{clustering} [@lang_mlr3_2019; @becker_mlr3_2022]. The standardized **mlr3** interface is based on eight 'building blocks'. As illustrated in Figure \@ref(fig:building-blocks), these have a clear order. -```{r building-blocks, echo=FALSE, fig.height=4, fig.width=4, fig.cap="Basic building blocks of the mlr3 package. Source: @becker_mlr3_2021 (Permission to reuse this figure was kindly granted.)", fig.scap="Basic building blocks of the mlr3 package."} +```{r building-blocks, echo=FALSE, fig.height=4, fig.width=4, fig.cap="Basic building blocks of the mlr3 package. Source: @becker_mlr3_2022. (Permission to reuse this figure was kindly granted.)", fig.scap="Basic building blocks of the mlr3 package."} knitr::include_graphics("figures/13_ml_abstraction_crop.png") ``` @@ -316,7 +311,7 @@ Third, the **resampling** approach assesses the predictive performance of the mo ### Generalized linear model {#glm} To implement a GLM\index{GLM} in **mlr3**\index{mlr3 (package)}, we must create a **task** containing the landslide data. -Since the response is binary (two-category variable) and has a spatial dimension, we create a classification\index{classification} task with `mlr3spatiotempcv::TaskClassifST$new()` (for non-spatial tasks, use `TaskClassif$new()` or `TaskRegr$new()` for regression\index{regression} tasks, see `?Task` for other task types).^[The **mlr3** ecosystem makes heavily use of **data.table** and **R6** classes. And though you might use **mlr3** without knowing the specifics of **data.table** or **R6**, it might be rather helpful. To learn more about **data.table**, please refer to https://rdatatable.gitlab.io/data.table/index.html. To learn more about **R6**, we recommend [Chapter 14](https://adv-r.hadley.nz/fp.html) of the Advanced R book [@wickham_advanced_2019].] +Since the response is binary (two-category variable) and has a spatial dimension, we create a classification\index{classification} task with `TaskClassifST$new()` of the **mlr3spatiotempcv** package [@schratz_mlr3spatiotempcv_2021, for non-spatial tasks, use `mlr3::TaskClassif$new()` or `mlr3::TaskRegr$new()` for regression\index{regression} tasks, see `?Task` for other task types].^[The **mlr3** ecosystem makes heavily use of **data.table** and **R6** classes. And though you might use **mlr3** without knowing the specifics of **data.table** or **R6**, it might be rather helpful. To learn more about **data.table**, please refer to https://rdatatable.gitlab.io/data.table/index.html. To learn more about **R6**, we recommend [Chapter 14](https://adv-r.hadley.nz/fp.html) of the Advanced R book [@wickham_advanced_2019].] The first essential argument of these `Task*$new()` functions is `backend`. `backend` expects the data to be used for the modeling including the response and predictor variables. The `target` argument indicates the name of a response variable (in our case this is `lslpts`) and `positive` determines which of the two factor levels of the response variable indicate the landslide initiation point (in our case this is `TRUE`). @@ -380,7 +375,7 @@ lrns_df = mlr3extralearners::list_mlr3learners( # "FNN", "kernlab", "RWeka")), row.names = c(NA, 6L), class = "data.frame") knitr::kable(lrns_df, caption = paste("Sample of available learners for binomial", - "tasks in the mlr package."), + "tasks in the mlr3 package."), caption.short = "Sample of available learners.", booktabs = TRUE) ``` @@ -423,7 +418,7 @@ identical(fit$coefficients, learner$model$coefficients) The set-up steps for modeling with **mlr3**\index{mlr3 (package)} may seem tedious. But remember, this single interface provides access to the 130+ learners shown by `mlr3extralearners::list_mlr3learners()`; it would be far more tedious to learn the interface for each learner! Further advantages are simple parallelization of resampling techniques and the ability to tune machine learning hyperparameters\index{hyperparameter} (see Section \@ref(svm)). -Most importantly, (spatial) resampling in **mlr3spatiotempcv** is straightforward, requiring only two more steps: specifying a resampling method and running it. +Most importantly, (spatial) resampling in **mlr3spatiotempcv** [@schratz_mlr3spatiotempcv_2021] is straightforward, requiring only two more steps: specifying a resampling method and running it. We will use a 100-repeated 5-fold spatial CV\index{cross-validation!spatial CV}: five partitions will be chosen based on the provided coordinates in our `task` and the partitioning will be repeated 100 times:[^13] [^13]: @@ -433,55 +428,57 @@ We will use a 100-repeated 5-fold spatial CV\index{cross-validation!spatial CV}: ```{r 12-spatial-cv-18, eval=FALSE} -perf_level = mlr3::rsmp("repeated_spcv_coords", folds = 5, repeats = 100) +resampling = mlr3::rsmp("repeated_spcv_coords", folds = 5, repeats = 100) ``` To execute the spatial resampling, we run `resample()` using the previously specified task, learner, and resampling strategy. -This takes some time (around 15 seconds on a modern laptop) because it computes the performance measure for 500 models. -Setting a seed ensures the reproducibility\index{reproducibility} of the obtained result and will ensure the same spatial partitioning when re-running the code. +This takes some time (around 15 seconds on a modern laptop) because it computes 500 resampling partitions and 500 models. +As performance measure, we again choose the AUROC. +To retrieve it, we use the `score()` method of the resampling result output object (`score_spcv_glm`). +This returns a `data.table` object with 500 rows -- one for each model. ```{r 12-spatial-cv-19, eval=FALSE} -set.seed(012348) # reduce verbosity lgr::get_logger("mlr3")$set_threshold("warn") -# run spatial cross-validation -sp_cv = mlr3::resample(task = task, - learner = learner, - resampling = perf_level) +# run spatial cross-validation and save it to resample result glm (rr_glm) +rr_spcv_glm = mlr3::resample(task = task, + learner = learner, + resampling = resampling) +# compute the AUROC as a data.table +score_spcv_glm = rr_glm$score(measure = mlr3::msr("classif.auc")) %>% + # keep only the columns you need + .[, .(task_id, learner_id, resampling_id, classif.auc)] ``` -The output of the preceding code chunk is a bias-reduced assessment of the model's predictive performance, as illustrated in the following code chunk (required input data is saved in the file `12-sp_cv.rds` in the book's GitHub repo). -As performance measure, we again choose the AUROC. +The output of the preceding code chunk is a bias-reduced assessment of the model's predictive performance. +We have saved it as `extdata/12-bmr_score.rds` in the book's GitHub repo. +If required, you can read it in as follows: -```{r 12-spatial-cv-21, echo=FALSE, eval=FALSE} -# bm_agg = readRDS("extdata/12-sp_conv_cv.rds") -# sp_cv = bm_agg$resample_result[[1]] -score = readRDS("extdata/12-glm_score_sp_nsp.rds") +```{r 12-spatial-cv-21} +score = readRDS("extdata/12-bmr_score.rds") +score_spcv_glm = score[learner_id == "classif.log_reg" & + resampling_id == "repeated_spcv_coords"] ``` -```{r 12-spatial-cv-22, eval=FALSE} -score[, .(mean_auc = mean(classif.auc)), by = resampling_id] -# summary statistics of the 500 models -# sp_cv$score(measures = mlr3::msr("classif.auc"))[, summary(classif.auc)] -# mean AUROC of the 500 models -# sp_cv$aggregate(measures = mlr3::msr("classif.auc")) +To compute the mean AUROC over all 500 models, we run: + +```{r 12-spatial-cv-22} +mean(score_spcv_glm$classif.auc) |> + round(2) ``` To put these results in perspective, let us compare them with AUROC\index{AUROC} values from a 100-repeated 5-fold non-spatial cross-validation (Figure \@ref(fig:boxplot-cv); the code for the non-spatial cross-validation\index{cross-validation} is not shown here but will be explored in the exercise section). As expected, the spatially cross-validated result yields lower AUROC values on average than the conventional cross-validation approach, underlining the over-optimistic predictive performance due to spatial autocorrelation\index{autocorrelation!spatial} of the latter. -```{r boxplot-cv, echo=FALSE, fig.width=6, fig.height=9, fig.cap="Boxplot showing the difference in AUROC values between spatial and conventional 100-repeated 5-fold cross-validation.", fig.scap="Boxplot showing AUROC values.", eval=FALSE} +```{r boxplot-cv, echo=FALSE, out.width="75%", fig.cap="Boxplot showing the difference in GLM AUROC values on spatial and conventional 100-repeated 5-fold cross-validation.", fig.scap="Boxplot showing AUROC values."} library(ggplot2) -# extract the AUROC values and put them into one data.table -# d = purrr::map_dfr(bm_agg$resample_result, ~ .$score(mlr3::msr("classif.auc"))) -d = score # rename the levels of resampling_id -d[, resampling_id := as.factor(resampling_id) |> - forcats::fct_recode("conventional CV" = "repeated_cv", - "spatial CV" = "repeated_spcv_coords") |> - forcats::fct_rev()] +score[, resampling_id := as.factor(resampling_id) |> + forcats::fct_recode("conventional CV" = "repeated_cv", + "spatial CV" = "repeated_spcv_coords") |> + forcats::fct_rev()] # create the boxplot -ggplot2::ggplot(data = d, +ggplot2::ggplot(data = score[learner_id == "classif.log_reg"], mapping = ggplot2::aes(x = resampling_id, y = classif.auc)) + ggplot2::geom_boxplot(fill = c("lightblue2", "mistyrose2")) + ggplot2::theme_bw() + @@ -526,9 +523,11 @@ mlr3extralearners::list_mlr3learners() %>% Of the options illustrated above, we will use `ksvm()` from the **kernlab** package [@karatzoglou_kernlab_2004]. To allow for non-linear relationships, we use the popular radial basis function (or Gaussian) kernel which is also the default of `ksvm()`. +To make sure that the tuning does not stop because of one failing model, we additionally define a fallback learner (for more information please refer to https://mlr3book.mlr-org.com/technical.html#fallback-learners). -```{r 12-spatial-cv-24, eval=FALSE} +```{r 12-spatial-cv-24} lrn_ksvm = mlr3::lrn("classif.ksvm", predict_type = "prob", kernel = "rbfdot") +lrn_ksvm$fallback = lrn("classif.featureless", predict_type = "prob") ``` The next stage is to specify a resampling strategy. @@ -537,12 +536,12 @@ Again we will use a 100-repeated 5-fold spatial CV\index{cross-validation!spatia -```{r 12-spatial-cv-25, eval=FALSE} +```{r 12-spatial-cv-25} # performance estimation level perf_level = mlr3::rsmp("repeated_spcv_coords", folds = 5, repeats = 100) ``` -Note that this is the exact same code as used for the GLM\index{GLM} in Section \@ref(glm); we have simply repeated it here as a reminder. +Note that this is the exact same code as used for the resampling for the GLM\index{GLM} in Section \@ref(glm); we have simply repeated it here as a reminder. So far, the process has been identical to that described in Section \@ref(glm). The next step is new, however: to tune the hyperparameters\index{hyperparameter}. @@ -554,8 +553,8 @@ knitr::include_graphics("figures/13_cv.png") ``` This means that we split each fold again into five spatially disjoint subfolds which are used to determine the optimal hyperparameters\index{hyperparameter} (`tune_level` object in the code chunk below; see Figure \@ref(fig:inner-outer) for a visual representation). -To find the optimal hyperparameter combination, we fit 50 models (`terminator` parameter in the code chunk below) in each of these subfolds with randomly selected values for the hyperparameters C and Sigma. -The random selection of values C and Sigma is additionally restricted to a predefined tuning space (`ps` object). +To find the optimal hyperparameter combination, we fit 50 models (`terminator` object in the code chunk below) in each of these subfolds with randomly selected values for the hyperparameters C and Sigma. +The random selection of values C and Sigma is additionally restricted to a predefined tuning space (`search_space` object). The range of the tuning space was chosen with values recommended in the literature [@schratz_hyperparameter_2019]. - So far spatial CV\index{cross-validation!spatial CV} has been used to assess the ability of learning algorithms to generalize to unseen data. For spatial predictions, one would tune the hyperparameters\index{hyperparameter} on the complete dataset. This will be covered in Chapter \@ref(eco). @@ -716,7 +698,7 @@ Machine learning algorithms often require hyperparameter\index{hyperparameter} i Machine learning overall, and its use to understand spatial data, is a large field and this chapter has provided the basics, but there is more to learn. We recommend the following resources in this direction: -- The **mlr3 book** [@becker_mlr3_2021; https://mlr-org.github.io/mlr-tutorial/release/html/)] and especially the [chapter on the handling of spatio-temporal data](https://mlr3book.mlr-org.com/spatiotemporal.html). +- The **mlr3 book** [@becker_mlr3_2022; https://mlr-org.github.io/mlr-tutorial/release/html/)] and especially the [chapter on the handling of spatio-temporal data](https://mlr3book.mlr-org.com/spatiotemporal.html). - An academic paper on hyperparameter\index{hyperparameter} tuning [@schratz_hyperparameter_2019] - An academic paper on how to use **mlr3spatiotempcv** [@schratz_mlr3spatiotempcv_2021] - In case of spatio-temporal data, one should account for spatial\index{autocorrelation!spatial} and temporal\index{autocorrelation!temporal} autocorrelation when doing CV\index{cross-validation} [@meyer_improving_2018] diff --git a/15-eco.Rmd b/15-eco.Rmd index 817a3beb8..4c9991ba0 100644 --- a/15-eco.Rmd +++ b/15-eco.Rmd @@ -7,7 +7,7 @@ In it you will also make use of R's\index{R} interfaces to dedicated GIS\index{G The chapter uses the following packages: -```{r 15-eco-1, message=FALSE, eval=FALSE} +```{r 15-eco-1, message=FALSE} library(data.table) library(dplyr) library(mlr3) @@ -70,7 +70,7 @@ To guarantee an optimal prediction, it is advisable to tune beforehand the hyper All the data needed for the subsequent analyses is available via the **spDataLarge** package. -```{r 15-eco-2, eval=FALSE} +```{r 15-eco-2} # spatial vector objects data("study_area", "random_points", "comm", package = "spDataLarge") # spatial raster objects @@ -229,13 +229,13 @@ ep$carea = log10(ep$carea) As a convenience to the reader, we have added `ep` to **spDataLarge**: -```{r 15-eco-9, eval=FALSE} +```{r 15-eco-9, cache.lazy=FALSE} ep = terra::rast(system.file("raster/ep.tif", package = "spDataLarge")) ``` Finally, we can extract the terrain attributes to our field observations (see also Section \@ref(raster-extraction)). -```{r 15-eco-10, eval=FALSE} +```{r 15-eco-10, cache=TRUE, cache.lazy=FALSE, message=FALSE, warning=FALSE} random_points[, names(ep)] = # terra::extract adds automatically a for our purposes unnecessary ID column terra::extract(ep, terra::vect(random_points)) |> @@ -272,7 +272,7 @@ Often ordinations\index{ordination} using presence-absence data yield better res Ordination techniques such as NMDS\index{NMDS} require at least one observation per site. Hence, we need to dismiss all sites in which no species were found. -```{r 15-eco-11, eval=FALSE} +```{r 15-eco-11} # presence-absence matrix pa = vegan::decostand(comm, "pa") # 100 rows (sites), 69 columns (species) # keep only sites in which at least one species was found @@ -303,7 +303,7 @@ nmds$stress saveRDS(nmds, "extdata/15-nmds.rds") ``` -```{r 15-eco-14, include=FALSE, eval=FALSE} +```{r 15-eco-14, include=FALSE} nmds = readRDS("extdata/15-nmds.rds") ``` @@ -326,7 +326,13 @@ plot(y = sc[, 1], x = elev, xlab = "elevation in m", ylab = "First NMDS axis", cex.lab = 0.8, cex.axis = 0.8) ``` -```{r xy-nmds, fig.cap="Plotting the first NMDS axis against altitude.", fig.scap = "First NMDS axis against altitude plot.", fig.asp=1, out.width="60%"} +```{r xy-nmds, fig.cap="Plotting the first NMDS axis against altitude.", fig.scap = "First NMDS axis against altitude plot.", fig.asp=1, out.width="60%", message=FALSE, echo=FALSE} +elev = dplyr::filter(random_points, id %in% rownames(pa)) |> + dplyr::pull(dem) +# rotating NMDS in accordance with altitude (proxy for humidity) +rotnmds = vegan::MDSrotate(nmds, elev) +# extracting the first two axes +sc = vegan::scores(rotnmds, choices = 1:2) knitr::include_graphics("figures/15_xy_nmds.png") ``` @@ -393,7 +399,7 @@ We refer the reader to @james_introduction_2013 for a more detailed description To introduce decision trees by example, we first construct a response-predictor matrix by joining the rotated NMDS\index{NMDS} scores to the field observations (`random_points`). We will also use the resulting data frame for the **mlr3**\index{mlr3 (package)} modeling later on. -```{r 15-eco-16, eval=FALSE} +```{r 15-eco-16, message=FALSE} # construct response-predictor matrix # id- and response variable rp = data.frame(id = as.numeric(rownames(sc)), sc = sc[, 1]) @@ -478,7 +484,7 @@ Having already constructed the input variables (`rp`), we are all set for specif For specifying a spatial task, we use again the **mlr3spatiotempcv** package [@schratz_mlr3spatiotempcv_2021 & Section \@ref(spatial-cv-with-mlr3)]. Since our response (`sc`) is numeric, we use a regression\index{regression} task. -```{r 15-eco-20, eval=FALSE} +```{r 15-eco-20} # create task task = mlr3spatiotempcv::TaskRegrST$new( id = "mongon", backend = dplyr::select(rp, -id, -spri), target = "sc") @@ -488,7 +494,7 @@ Using an `sf` object as the backend automatically provides the geometry informat Additionally, we got rid of the columns `id` and `spri` since these variables should not be used as predictors in the modeling. Next, we go on to construct the a random forest\index{random forest} learner from the **ranger** package. -```{r 15-eco-21, eval=FALSE} +```{r 15-eco-21} lrn_rf = lrn("regr.ranger", predict_type = "response") ``` @@ -507,7 +513,7 @@ Hyperparameter\index{hyperparameter} combinations will be selected randomly but (4), `sample.fraction` should range between 0.2 and 0.9 and `min.node.size` should range between 1 and 10. -```{r 14-eco-23, eval=FALSE} +```{r 15-eco-23} # specifying the search space search_space = paradox::ps( mtry = paradox::p_int(lower = 1, upper = ncol(task$data()) - 1), @@ -517,12 +523,12 @@ search_space = paradox::ps( ``` Having defined the search space, we are all set for specifying our tuning via the `AutoTuner()` function. -Since we deal with geographic data, we will again make use of spatial cross-validation to tune the hyperparameters\index{hyperparameter} (see Sections \@ref(intro-cv) and \@ref(spatial-cv-with-mlr)). +Since we deal with geographic data, we will again make use of spatial cross-validation to tune the hyperparameters\index{hyperparameter} (see Sections \@ref(intro-cv) and \@ref(spatial-cv-with-mlr3)). Specifically, we will use a five-fold spatial partitioning with only one repetition (`rsmp()`). -In each of these spatial partitions, we run 50 models (`trm()`) while using randomly selected hyperparameter configurations (`tnr()`) within predefined limits (`seach_space`) to find the optimal hyperparameter\index{hyperparameter} combination [see also Section \@ref(svm) and https://mlr3book.mlr-org.com/optimization.html#autotuner, @becker_mlr3_2021]. +In each of these spatial partitions, we run 50 models (`trm()`) while using randomly selected hyperparameter configurations (`tnr()`) within predefined limits (`seach_space`) to find the optimal hyperparameter\index{hyperparameter} combination [see also Section \@ref(svm) and https://mlr3book.mlr-org.com/optimization.html#autotuner, @becker_mlr3_2022]. The performance measure is the root mean squared error (RMSE\index{RMSE}). -```{r 15-eco-22, eval=FALSE} +```{r 15-eco-22} autotuner_rf = mlr3tuning::AutoTuner$new( learner = lrn_rf, # spatial partitioning @@ -540,7 +546,7 @@ autotuner_rf = mlr3tuning::AutoTuner$new( Calling the `train()`-method of the `AutoTuner`-object finally runs the hyperparameter\index{hyperparameter} tuning, and will find the optimal hyperparameter\index{hyperparameter} combination for the specified parameters. -```{r 14-eco-24, eval=FALSE} +```{r 15-eco-24, eval=FALSE} # hyperparameter tuning set.seed(0412022) autotuner_rf$train(task) @@ -561,25 +567,21 @@ autotuner_rf$train(task) saveRDS(autotuner_rf, "extdata/15-tune.rds") ``` -```{r 15-eco-26, echo=FALSE, eval=FALSE} +```{r 15-eco-26, echo=FALSE, eval=TRUE, cache.lazy=FALSE} autotuner_rf = readRDS("extdata/15-tune.rds") ``` -An `mtry` of 4, a `sample.fraction` of 0.9, and a `min.node.size` of 7 represent the best hyperparameter\index{hyperparameter} combination. -An RMSE\index{RMSE} of - -0.38 +An `mtry` of `r autotuner_rf$tuning_result$mtry`, a `sample.fraction` of `r round(autotuner_rf$tuning_result$sample.fraction, 2)`, and a `min.node.size` of `r autotuner_rf$tuning_result$min.node.size` represent the best hyperparameter\index{hyperparameter} combination. +An RMSE\index{RMSE} of `r round(autotuner_rf$tuning_result$regr.rmse, 2)` is relatively good when considering the range of the response variable which is - -3.04 -(`diff(range(rp$sc))`). +`r round(diff(range(rp$sc)), 2)` (`diff(range(rp$sc))`). ### Predictive mapping The tuned hyperparameters\index{hyperparameter} can now be used for the prediction. To do so, we only need to run the `predict` method of our fitted `AutoTuner` object. -```{r 15-eco-27, eval=FALSE} +```{r 15-eco-27, eval=TRUE, cache.lazy=FALSE} # predicting using the best hyperparameter combination autotuner_rf$predict(task) ``` @@ -587,7 +589,7 @@ autotuner_rf$predict(task) The `predict` method will apply the model to all observations used in the modeling. Given a multilayer `SpatRaster` containing rasters named as the predictors used in the modeling, `terra::predict()` will also make spatial predictions, i.e., predict to new data. -```{r 15-eco-28, eval=FALSE} +```{r 15-eco-28, eval=TRUE, cache.lazy=FALSE} pred = terra::predict(ep, model = autotuner_rf, fun = predict) ``` @@ -627,7 +629,7 @@ knitr::include_graphics("figures/15_rf_pred.png") In case, `terra::predict()` does not support a model algorithm, you can still make the predictions manually. -```{r 15-eco-29, eval=FALSE} +```{r 15-eco-29, cache.lazy=FALSE} newdata = as.data.frame(as.matrix(ep)) colSums(is.na(newdata)) # 0 NAs # but assuming there were 0s results in a more generic approach @@ -637,7 +639,8 @@ newdata[ind, "pred"] = data.table::as.data.table(tmp)[["response"]] pred_2 = ep$dem # now fill the raster with the predicted values pred_2[] = newdata$pred -identical(values(pred), values(pred_2)) # TRUE +# check if terra and our manual prediction is the same +all(values(pred - pred_2) == 0) ``` The predictive mapping clearly reveals distinct vegetation belts (Figure \@ref(fig:rf-pred)). diff --git a/_12-ex.Rmd b/_12-ex.Rmd index f7c05238a..f6b938c2f 100644 --- a/_12-ex.Rmd +++ b/_12-ex.Rmd @@ -1,4 +1,6 @@ +```{asis, message=FALSE} The solutions assume the following packages are attached (other packages will be attached when needed): +``` ```{r 12-ex-e0, message=FALSE, warning=FALSE} library(dplyr) @@ -16,6 +18,7 @@ library(tmap) ``` E1. Compute the following terrain attributes from the `elev` dataset loaded with `terra::rast(system.file("raster/ta.tif", package = "spDataLarge"))$elev` with the help of R-GIS bridges (see this [Chapter](https://geocompr.robinlovelace.net/gis.html#gis)): + - Slope - Plan curvature - Profile curvature @@ -116,9 +119,11 @@ tm_shape(hs, bbox = bbx) + legend.title.size = 0.9) ``` -E4. Compute a 100-repeated 5-fold non-spatial cross-validation and spatial CV based on the GLM learner and compare the AUROC values from both resampling strategies with the help of boxplots (see this [Figure](https://geocompr.robinlovelace.net/spatial-cv.html#fig:boxplot-cv). +E4. Compute a 100-repeated 5-fold non-spatial cross-validation and spatial CV based on the GLM learner and compare the AUROC values from both resampling strategies with the help of boxplots (see this [Figure](https://geocompr.robinlovelace.net/spatial-cv.html#fig:boxplot-cv)). + Hint: You need to specify a non-spatial resampling strategy. -Another hint: You might want to Excercises 4 to 6 in one go with the help of `mlr3::benchmark()` and `mlr3::benchmark_grid()` (for more information, please refer to https://mlr3book.mlr-org.com/perf-eval-cmp.html#benchmarking). + +Another hint: You might want to solve Excercises 4 to 6 in one go with the help of `mlr3::benchmark()` and `mlr3::benchmark_grid()` (for more information, please refer to https://mlr3book.mlr-org.com/perf-eval-cmp.html#benchmarking). When doing so, keep in mind that the computation can take very long, probably several days. This, of course, depends on your system. Computation time will be shorter the more RAM and cores you have at your disposal. @@ -140,11 +145,14 @@ task = TaskClassifST$new( # construct learners (for all subsequent exercises) # GLM lrn_glm = lrn("classif.log_reg", predict_type = "prob") +lrn_glm$fallback = lrn("classif.featureless", predict_type = "prob") # SVM # construct SVM learner (using ksvm function from the kernlab package) lrn_ksvm = lrn("classif.ksvm", predict_type = "prob", kernel = "rbfdot", type = "C-svc") +lrn_ksvm$fallback = lrn("classif.featureless", predict_type = "prob") + # specify nested resampling and adjust learner accordingly # five spatially disjoint partitions tune_level = rsmp("spcv_coords", folds = 5) @@ -167,6 +175,7 @@ at_ksvm = AutoTuner$new( # QDA lrn_qda = lrn("classif.qda", predict_type = "prob") +lrn_qda$fallback = lrn("classif.featureless", predict_type = "prob") # SVM without tuning hyperparameters vals = lrn_ksvm$param_set$values @@ -194,7 +203,10 @@ library(future) future::plan(list("sequential", "multisession"), workers = floor(availableCores() / 2)) set.seed(021522) -bmr = benchmark(grid, store_backends = FALSE) +bmr = benchmark(grid, + store_backends = FALSE, + store_models = FALSE, + encapsulate = "evaluate") # stop parallelization future:::ClusterRegistry("stop") # save your result, e.g. to diff --git a/code/12-cv.R b/code/12-cv.R index fcc9f37ef..d30f90fad 100644 --- a/code/12-cv.R +++ b/code/12-cv.R @@ -60,6 +60,7 @@ task = TaskClassifST$new( # glm learner lrn_glm = lrn("classif.log_reg", predict_type = "prob") +# define a fallback learner in case one model fails lrn_glm$fallback = lrn("classif.featureless", predict_type = "prob") # construct SVM learner (using ksvm function from the kernlab package) @@ -82,7 +83,7 @@ tune_level = rsmp("spcv_coords", folds = 5) terminator = trm("evals", n_evals = 50) tuner = tnr("random_search") # define the outer limits of the randomly selected hyperparameters -ps = ps( +search_space = ps( C = p_dbl(lower = -12, upper = 15, trafo = function(x) 2^x), sigma = p_dbl(lower = -15, upper = 6, trafo = function(x) 2^x) ) @@ -91,11 +92,11 @@ at_ksvm = AutoTuner$new( learner = lrn_ksvm, resampling = tune_level, measure = msr("classif.auc"), - search_space = ps, + search_space = search_space, terminator = terminator, tuner = tuner ) -at_ksvm$fallback = lrn("classif.featureless", predict_type = "prob") + # 2.3 cross-validation==== #********************************************************** @@ -141,6 +142,8 @@ progressr::with_progress(expr = { tictoc::toc() # stop the parallelization plan future:::ClusterRegistry("stop") +# save your result +# saveRDS(bmr, file = "extdata/12-bmr_glm_svm_spcv_convcv.rds") # plot your result # library(mlr3viz) @@ -152,24 +155,26 @@ future:::ClusterRegistry("stop") # ggplot2::theme_bw() # instead of using autoplot, it might be easier to create the figure yourself -agg = bmr$aggregate(measures = msr("classif.auc")) -# # extract the AUROC values and put them into one data.table -# d = purrr::map_dfr(agg$resample_result, ~ .$score(msr("classif.auc"))) -# # rename the levels of resampling_id -# d[, resampling_id := as.factor(resampling_id) |> -# forcats::fct_recode("conventional CV" = "repeated_cv", -# "spatial CV" = "repeated_spcv_coords") |> -# forcats::fct_rev()] +score = bmr$score(measure = mlr3::msr("classif.auc")) %>% + # keep only the columns you need + .[, .(task_id, learner_id, resampling_id, classif.auc)] +# or read in the score +# score = readRDS("extdata/12-bmr_score.rds") + +# rename the levels of resampling_id +score[, resampling_id := as.factor(resampling_id) |> + forcats::fct_recode("conventional CV" = "repeated_cv", + "spatial CV" = "repeated_spcv_coords") |> + forcats::fct_rev()] # create the boxplot which shows the overfitting in the nsp-case -# ggplot2::ggplot(data = d, mapping = aes(x = resampling_id, y = classif.auc)) + -# geom_boxplot(fill = c("lightblue2", "mistyrose2")) + -# theme_bw() + -# labs(y = "AUROC", x = "") - -# save your result -saveRDS(agg, file = "/home/jannes.muenchow/rpackages/misc/geocompr/extdata/12-sp_conv_cv_test.rds") -# read in if necessary -# agg = readRDS("extdata/12-sp_conv_cv.rds") +ggplot2::ggplot( + data = score, + mapping = ggplot2::aes(x = interaction(resampling_id, learner_id), + y = classif.auc, fill = resampling_id + )) + + ggplot2::geom_boxplot() + + ggplot2::theme_bw() + + ggplot2::labs(y = "AUROC", x = "") #********************************************************** # 3 spatial prediction---- @@ -179,74 +184,58 @@ saveRDS(agg, file = "/home/jannes.muenchow/rpackages/misc/geocompr/extdata/12-sp # 3.1 make the prediction using the glm==== #********************************************************** -# lrn_glm$train(task) -# fit = lrn_glm$model -# # according to lrn_glm$help() the default for predictions was adjusted to FALSE, -# # since we would like to have TRUE predictions, we have to change back -# # fit$coefficients = fit$coefficients * -1 -# -# pred = terra::predict(object = ta, model = fit, fun = predict, -# type = "response") -# -# # make the prediction "manually" -# ta_2 = ta -# newdata = as.data.frame(as.matrix(ta_2)) -# colSums(is.na(newdata)) # ok, there are NAs -# ind = rowSums(is.na(newdata)) == 0 -# tmp = lrn_glm$predict_newdata(newdata = newdata[ind, ], task = task) -# newdata[ind, "pred"] = as.data.table(tmp)[["prob.TRUE"]] -# # check -# all.equal(as.numeric(values(pred)), newdata$pred) -# pred_2 = ta$slope -# values(pred_2) = newdata$pred -# all.equal(pred, pred_2) -# plot(c(pred, pred_2)) -# -# #********************************************************** -# # 3.2 plot the prediction==== -# #********************************************************** -# -# hs = terra::shade(ta$slope * pi / 180, -# terra::terrain(ta$elev, v = "aspect", unit = "radians"), -# 40, 270) -# plot(hs, col = gray(seq(0, 1, length.out = 100)), legend = FALSE) -# plot(pred, col = RColorBrewer::brewer.pal(name = "Reds", 9), add = TRUE, -# -# # or using tmap -# # white raster to only plot the axis ticks, otherwise gridlines would be visible -# study_mask = terra::vect(study_mask) -# lsl_sf = st_as_sf(lsl, coords = c("x", "y"), crs = 32717) -# rect = tmaptools::bb_poly(raster::raster(hs)) -# bbx = tmaptools::bb(raster::raster(hs), xlim = c(-0.02, 1), ylim = c(-0.02, 1), -# relative = TRUE) -# tm_shape(hs, bbox = bbx) + -# tm_grid(col = "black", n.x = 1, n.y = 1, labels.inside.frame = FALSE, -# labels.rot = c(0, 90)) + -# tm_raster(palette = "white", legend.show = FALSE) + -# # hillshade -# tm_shape(terra::mask(hs, study_mask), bbox = bbx) + -# tm_raster(palette = gray(0:100 / 100), n = 100, -# legend.show = FALSE) + -# # prediction raster -# tm_shape(terra::mask(pred, study_mask)) + -# tm_raster(alpha = 0.5, palette = "Reds", n = 6, legend.show = TRUE, -# title = "Susceptibility") + -# # rectangle and outer margins -# qtm(rect, fill = NULL) + -# tm_layout(outer.margins = c(0.04, 0.04, 0.02, 0.02), frame = FALSE, -# legend.position = c("left", "bottom"), -# legend.title.size = 0.9) - -# # only GLM -design_grid = benchmark_grid( - tasks = task, - learners = lrn_glm, - resamplings = list(rsmp_sp, rsmp_nsp)) -bmr = benchmark(design = design_grid, store_backends = FALSE) -# bmr_dt = as.data.table(bmr) -# saveRDS(bmr, "extdata/12-glm_sp_nsp_bmr.rds") -# agg = bmr$aggregate(measures = mlr3::msr("classif.auc")) -# saveRDS(agg, "extdata/12-glm_sp_nsp.rds") -score = bmr$score(measures = msr("classif.auc")) -saveRDS(score[, .(task_id, resampling_id, learner_id, classif.auc)], - "extdata/12-glm_score_sp_nsp.rds") +lrn_glm$train(task) +fit = lrn_glm$model +# according to lrn_glm$help() the default for predictions was adjusted to FALSE, +# since we would like to have TRUE predictions, we have to change back +# fit$coefficients = fit$coefficients * -1 + +pred = terra::predict(object = ta, model = fit, fun = predict, + type = "response") + +# make the prediction "manually" +ta_2 = ta +newdata = as.data.frame(as.matrix(ta_2)) +colSums(is.na(newdata)) # ok, there are NAs +ind = rowSums(is.na(newdata)) == 0 +tmp = lrn_glm$predict_newdata(newdata = newdata[ind, ], task = task) +newdata[ind, "pred"] = as.data.table(tmp)[["prob.TRUE"]] +# check +all.equal(as.numeric(values(pred)), newdata$pred) # TRUE +pred_2 = ta$slope +pred_2[] = newdata$pred + +#********************************************************** +# 3.2 plot the prediction==== +#********************************************************** + +hs = terra::shade(ta$slope * pi / 180, + terra::terrain(ta$elev, v = "aspect", unit = "radians"), + 40, 270) +plot(hs, col = gray(seq(0, 1, length.out = 100)), legend = FALSE) +plot(pred, col = RColorBrewer::brewer.pal(name = "Reds", 9), add = TRUE) + +# or using tmap +# white raster to only plot the axis ticks, otherwise gridlines would be visible +study_mask = terra::vect(study_mask) +lsl_sf = st_as_sf(lsl, coords = c("x", "y"), crs = 32717) +rect = tmaptools::bb_poly(raster::raster(hs)) +bbx = tmaptools::bb(raster::raster(hs), xlim = c(-0.02, 1), ylim = c(-0.02, 1), + relative = TRUE) +tm_shape(hs, bbox = bbx) + + tm_grid(col = "black", n.x = 1, n.y = 1, labels.inside.frame = FALSE, + labels.rot = c(0, 90)) + + tm_raster(palette = "white", legend.show = FALSE) + + # hillshade + tm_shape(terra::mask(hs, study_mask), bbox = bbx) + + tm_raster(palette = gray(0:100 / 100), n = 100, + legend.show = FALSE) + + # prediction raster + tm_shape(terra::mask(pred, study_mask)) + + tm_raster(alpha = 0.5, palette = "Reds", n = 6, legend.show = TRUE, + title = "Susceptibility") + + # rectangle and outer margins + qtm(rect, fill = NULL) + + tm_layout(outer.margins = c(0.04, 0.04, 0.02, 0.02), frame = FALSE, + legend.position = c("left", "bottom"), + legend.title.size = 0.9) diff --git a/extdata/12-bmr_score.rds b/extdata/12-bmr_score.rds new file mode 100644 index 000000000..a0450cf10 Binary files /dev/null and b/extdata/12-bmr_score.rds differ diff --git a/extdata/12-sp_conv_cv.rds b/extdata/12-sp_conv_cv.rds deleted file mode 100644 index be6e1234d..000000000 Binary files a/extdata/12-sp_conv_cv.rds and /dev/null differ diff --git a/extdata/12-spatial_cv_result.rds b/extdata/12-spatial_cv_result.rds deleted file mode 100644 index d33a487ea..000000000 Binary files a/extdata/12-spatial_cv_result.rds and /dev/null differ diff --git a/geocompr.bib b/geocompr.bib index 1a0687533..57e3879a5 100644 --- a/geocompr.bib +++ b/geocompr.bib @@ -2177,11 +2177,20 @@ @article{schratz_mlr3spatiotempcv_2021 journal = {arXiv preprint arXiv:2110.12674} } -@book{becker_mlr3_2021, +@book{becker_mlr3_2022, title={mlr3 book}, author={Becker, M. and Binder, M. and Bischl, B. and Lang, M. and Pfisterer, F. and Reich, N.G. and Richter, J. and Schratz, P. and Sonabend, R.}, url={https://mlr3book.mlr-org.com}, - year={2021} + year={2022} } - +@article{lang_mlr3_2019, + title = {Mlr3: {{A}} Modern Object-Oriented Machine Learning Framework in {{R}}}, + shorttitle = {Mlr3}, + author = {Lang, Michel and Binder, Martin and Richter, Jakob and Schratz, Patrick and Pfisterer, Florian and Coors, Stefan and Au, Quay and Casalicchio, Giuseppe and Kotthoff, Lars and Bischl, Bernd}, + year = {2019}, + volume = {4}, + pages = {1903}, + journal = {Journal of Open Source Software}, + number = {44} +} \ No newline at end of file