-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Review of ch12 #782
Review of ch12 #782
Changes from all commits
6916cb6
3a7a011
64adfd4
09314a3
4bddbdc
5ea0419
cec166a
b56b773
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -30,9 +30,9 @@ Required data will be attached in due course. | |
|
||
Statistical learning\index{statistical learning} is concerned with the use of statistical and computational models for identifying patterns in data and predicting from these patterns. | ||
Due to its origins, statistical learning\index{statistical learning} is one of R's\index{R} great strengths (see Section \@ref(software-for-geocomputation)).^[ | ||
Applying statistical techniques to geographic data has been an active topic of research for many decades in the fields of Geostatistics, Spatial Statistics and point pattern analysis [@diggle_modelbased_2007; @gelfand_handbook_2010; @baddeley_spatial_2015]. | ||
Applying statistical techniques to geographic data has been an active topic of research for many decades in the fields of geostatistics, spatial statistics and point pattern analysis [@diggle_modelbased_2007; @gelfand_handbook_2010; @baddeley_spatial_2015]. | ||
] | ||
Statistical learning\index{statistical learning} combines methods from statistics\index{statistics} and machine learning\index{machine learning} and its methods can be categorized into supervised and unsupervised techniques. | ||
Statistical learning\index{statistical learning} combines methods from statistics\index{statistics} and machine learning\index{machine learning} and can be categorized into supervised and unsupervised techniques. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One question: could we rename the chapter Geostatistical learning? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, wouldn't do that, geostatistics is basically a field of its own and we are not doing geostatistics here |
||
Both are increasingly used in disciplines ranging from physics, biology and ecology to geography and economics [@james_introduction_2013]. | ||
|
||
This chapter focuses on supervised techniques in which there is a training dataset, as opposed to unsupervised techniques such as clustering\index{clustering}. | ||
|
@@ -79,7 +79,7 @@ data("lsl", "study_mask", package = "spDataLarge") | |
ta = terra::rast(system.file("raster/ta.tif", package = "spDataLarge")) | ||
``` | ||
|
||
This should load three objects: a `data.frame` named `lsl`, an `sf` object named `study_mask` and a `SpatRaster` (see Section \@ref(raster-classes)) named `ta` containing terrain attribute rasters. | ||
The above code loads three objects: a `data.frame` named `lsl`, an `sf` object named `study_mask` and a `SpatRaster` (see Section \@ref(raster-classes)) named `ta` containing terrain attribute rasters. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 |
||
`lsl` contains a factor column `lslpts` where `TRUE` corresponds to an observed landslide 'initiation point', with the coordinates stored in columns `x` and `y`.^[ | ||
The landslide initiation point is located in the scarp of a landslide polygon. See @muenchow_geomorphic_2012 for further details. | ||
] | ||
|
@@ -90,28 +90,26 @@ The 175 non-landslide points were sampled randomly from the study area, with the | |
# library(tmap) | ||
# data("lsl", package = "spDataLarge") | ||
# ta = terra::rast(system.file("raster/ta.tif", package = "spDataLarge")) | ||
# lsl_sf = sf::st_as_sf(lsl, coords = c("x", "y"), crs = 32717) | ||
# lsl_sf = sf::st_as_sf(lsl, coords = c("x", "y"), crs = "EPSG:32717") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 |
||
# hs = terra::shade(slope = ta$slope * pi / 180, | ||
# terra::terrain(ta$elev, v = "aspect", unit = "radians")) | ||
# # so far tmaptools does not support terra objects | ||
# | ||
# rect = tmaptools::bb_poly(raster::raster(hs)) | ||
# bbx = tmaptools::bb(raster::raster(hs), xlim = c(-0.0001, 1), | ||
# bbx = tmaptools::bb(raster::raster(hs), xlim = c(-0.0001, 1), | ||
# ylim = c(-0.0001, 1), relative = TRUE) | ||
# | ||
# map = tm_shape(hs, bbox = bbx) + | ||
# tm_grid(col = "black", n.x = 1, n.y = 1, labels.inside.frame = FALSE, | ||
# labels.rot = c(0, 90)) + | ||
# labels.rot = c(0, 90), lines = FALSE) + | ||
# tm_raster(palette = gray(0:100 / 100), n = 100, legend.show = FALSE) + | ||
# tm_shape(ta$elev) + | ||
# tm_raster(alpha = 0.5, palette = terrain.colors(10), legend.show = FALSE) + | ||
# tm_shape(lsl_sf) + | ||
# tm_bubbles("lslpts", size = 0.2, palette = "-RdYlBu", | ||
# tm_bubbles("lslpts", size = 0.2, palette = "-RdYlBu", | ||
# title.col = "Landslide: ") + | ||
# qtm(rect, fill = NULL) + | ||
# tm_layout(outer.margins = c(0.04, 0.04, 0.02, 0.02), frame = FALSE) + | ||
# tm_layout(inner.margins = 0) + | ||
# tm_legend(bg.color = "white") | ||
# tmap::tmap_save(map, filename = "figures/lsl-map-1.png", width = 11, | ||
# tmap::tmap_save(map, filename = "figures/lsl-map-1.png", width = 11, | ||
# height = 11, units = "cm") | ||
knitr::include_graphics("figures/lsl-map-1.png") | ||
``` | ||
|
@@ -121,21 +119,20 @@ The first three rows of `lsl`, rounded to two significant digits, can be found i | |
|
||
```{r lslsummary, echo=FALSE, warning=FALSE} | ||
lsl_table = lsl |> | ||
mutate(across(.cols = -any_of(c("x", "y", "lslpts")), ~signif(., 2))) |> | ||
head(3) | ||
knitr::kable(lsl_table, caption = "Structure of the lsl dataset.", | ||
mutate(across(.cols = -any_of(c("x", "y", "lslpts")), ~signif(., 2))) | ||
knitr::kable(lsl_table[c(1, 2, 350), ], caption = "Structure of the lsl dataset.", | ||
caption.short = "`lsl` dataset.", booktabs = TRUE) |> | ||
kableExtra::kable_styling(latex_options = "scale_down") | ||
``` | ||
|
||
To model landslide susceptibility, we need some predictors. | ||
Since terrain attributes are frequently associated with landsliding [@muenchow_geomorphic_2012], we have already extracted following terrain attributes from `ta` to `lsl`: | ||
|
||
- `slope`: slope angle (°). | ||
- `cplan`: plan curvature (rad m^−1^) expressing the convergence or divergence of a slope and thus water flow. | ||
- `cprof`: profile curvature (rad m^-1^) as a measure of flow acceleration, also known as downslope change in slope angle. | ||
- `elev`: elevation (m a.s.l.) as the representation of different altitudinal zones of vegetation and precipitation in the study area. | ||
- `log10_carea`: the decadic logarithm of the catchment area (log10 m^2^) representing the amount of water flowing towards a location. | ||
- `slope` - slope angle (°) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I prefer the previous style here, but without the full stops. So I would make it:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Robinlovelace The new style is consistent with the style in chapters 1-8 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Which parts @Nowosad ? Had a quick look at the output below.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I based it on the example from ch5. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the style is fine but the dash in
Does not seem standard. Same with
Also if there are full stops (periods) in the bullet points there should a bullet point at the end: https://www.instructionalsolutions.com/blog/bulleted-list-punctuation Regarding colons vs dashes, I think both
and
Would be right with the former being an 'em dash' There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Robinlovelace I am fine with colons -- we just need to use them consistently. |
||
- `cplan` - plan curvature (rad m^−1^) expressing the convergence or divergence of a slope and thus water flow | ||
- `cprof` - profile curvature (rad m^-1^) as a measure of flow acceleration, also known as downslope change in slope angle | ||
- `elev` - elevation (m a.s.l.) as the representation of different altitudinal zones of vegetation and precipitation in the study area | ||
- `log10_carea` - the decadic logarithm of the catchment area (log10 m^2^) representing the amount of water flowing towards a location | ||
|
||
It might be a worthwhile exercise to compute the terrain attributes with the help of R-GIS bridges (see Chapter \@ref(gis)) and extract them to the landslide points (see Exercise section at the end of this Chapter). | ||
|
||
|
@@ -158,7 +155,7 @@ It is worth understanding each of the three input arguments: | |
|
||
- A formula, which specifies landslide occurrence (`lslpts`) as a function of the predictors | ||
- A family, which specifies the type of model, in this case `binomial` because the response is binary (see `?family`) | ||
- The data frame which contains the response and the predictors | ||
- The data frame which contains the response and the predictors (as columns) | ||
|
||
The results of this model can be printed as follows (`summary(fit)` provides a more detailed account of the results): | ||
|
||
|
@@ -179,7 +176,7 @@ head(pred_glm) | |
|
||
Spatial predictions can be made by applying the coefficients to the predictor rasters. | ||
This can be done manually or with `terra::predict()`. | ||
In addition to a model object (`fit`), this function also expects a `SpatRaster` with the predictors named as in the model's input data frame (Figure \@ref(fig:lsl-susc)). | ||
In addition to a model object (`fit`), this function also expects a `SpatRaster` with the predictors (raster layers) named as in the model's input data frame (Figure \@ref(fig:lsl-susc)). | ||
|
||
```{r 12-spatial-cv-9, eval=FALSE} | ||
# making the prediction | ||
|
@@ -191,16 +188,15 @@ pred = terra::predict(ta, model = fit, type = "response") | |
# data("lsl", "study_mask", package = "spDataLarge") | ||
# ta = terra::rast(system.file("raster/ta.tif", package = "spDataLarge")) | ||
# study_mask = terra::vect(study_mask) | ||
# # white raster to only plot the axis ticks, otherwise gridlines would be visible | ||
# lsl_sf = sf::st_as_sf(lsl, coords = c("x", "y"), crs = 32717) | ||
# hs = terra::shade(ta$slope * pi / 180, | ||
# terra::terrain(ta$elev, v = "aspect", unit = "radians")) | ||
# rect = tmaptools::bb_poly(raster::raster(hs)) | ||
# bbx = tmaptools::bb(raster::raster(hs), xlim = c(-0.00001, 1), | ||
# ylim = c(-0.00001, 1), relative = TRUE) | ||
# bbx = tmaptools::bb(raster::raster(hs), xlim = c(-0.0001, 1), | ||
# ylim = c(-0.0001, 1), relative = TRUE) | ||
# | ||
# map = tm_shape(hs, bbox = bbx) + | ||
# tm_grid(col = "black", n.x = 1, n.y = 1, labels.inside.frame = FALSE, | ||
# labels.rot = c(0, 90)) + | ||
# labels.rot = c(0, 90), lines = FALSE) + | ||
# tm_raster(palette = "white", legend.show = FALSE) + | ||
# # hillshade | ||
# tm_shape(terra::mask(hs, study_mask), bbox = bbx) + | ||
|
@@ -210,12 +206,10 @@ pred = terra::predict(ta, model = fit, type = "response") | |
# tm_shape(terra::mask(pred, study_mask)) + | ||
# tm_raster(alpha = 0.5, palette = "Reds", n = 6, legend.show = TRUE, | ||
# title = "Susceptibility") + | ||
# # rectangle and outer margins | ||
# qtm(rect, fill = NULL) + | ||
# tm_layout(outer.margins = c(0.04, 0.04, 0.02, 0.02), frame = FALSE, | ||
# legend.position = c("left", "bottom"), | ||
# legend.title.size = 0.9) | ||
# tmap::tmap_save(map, filename = "figures/lsl-susc-1.png", width = 11, | ||
# tm_layout(legend.position = c("left", "bottom"), | ||
# legend.title.size = 0.9, | ||
# inner.margins = 0) | ||
# tmap::tmap_save(map, filename = "figures/lsl-susc-1.png", width = 11, | ||
# height = 11, units = "cm") | ||
knitr::include_graphics("figures/lsl-susc-1.png") | ||
``` | ||
|
@@ -313,12 +307,12 @@ Third, the **resampling** approach assesses the predictive performance of the mo | |
To implement a GLM\index{GLM} in **mlr3**\index{mlr3 (package)}, we must create a **task** containing the landslide data. | ||
Since the response is binary (two-category variable) and has a spatial dimension, we create a classification\index{classification} task with `TaskClassifST$new()` of the **mlr3spatiotempcv** package [@schratz_mlr3spatiotempcv_2021, for non-spatial tasks, use `mlr3::TaskClassif$new()` or `mlr3::TaskRegr$new()` for regression\index{regression} tasks, see `?Task` for other task types].^[The **mlr3** ecosystem makes heavily use of **data.table** and **R6** classes. And though you might use **mlr3** without knowing the specifics of **data.table** or **R6**, it might be rather helpful. To learn more about **data.table**, please refer to https://rdatatable.gitlab.io/data.table/index.html. To learn more about **R6**, we recommend [Chapter 14](https://adv-r.hadley.nz/fp.html) of the Advanced R book [@wickham_advanced_2019].] | ||
The first essential argument of these `Task*$new()` functions is `backend`. | ||
`backend` expects the data to be used for the modeling including the response and predictor variables. | ||
`backend` expects that the input data includes the response and predictor variables. | ||
The `target` argument indicates the name of a response variable (in our case this is `lslpts`) and `positive` determines which of the two factor levels of the response variable indicate the landslide initiation point (in our case this is `TRUE`). | ||
All other variables of the `lsl` dataset will serve as predictors. | ||
For spatial CV, we need to provide a few extra arguments (`extra_args`). | ||
The `coordinate_names` argument expects the names of the coordinate columns (see Section \@ref(intro-cv) and Figure \@ref(fig:partitioning)). | ||
Additionally, one should indicate the used CRS (`crs`) and if one wishes to use the coordinates as predictors in the modeling (`coords_as_features`). | ||
Additionally, we should indicate the used CRS (`crs`) and decide if we want to use the coordinates as predictors in the modeling (`coords_as_features`). | ||
|
||
```{r 12-spatial-cv-11, eval=FALSE} | ||
# create task | ||
|
@@ -330,17 +324,16 @@ task = mlr3spatiotempcv::TaskClassifST$new( | |
extra_args = list( | ||
coordinate_names = c("x", "y"), | ||
coords_as_features = FALSE, | ||
crs = 32717) | ||
crs = "EPSG:32717") | ||
) | ||
``` | ||
|
||
Note that `TaskClassifST$new()` also accepts an `sf`-object as input for the `backend` parameter. | ||
In this case, you might only want to specify the `coords_as_features` argument of the `extra_args` list. | ||
We did not convert `lsl` into an `sf`-object because `TaskClassifST$new()` just converts it back into a non-spatial `data.table` object in the background. | ||
We did not convert `lsl` into an `sf`-object because `TaskClassifST$new()` would just turn it back into a non-spatial `data.table` object in the background. | ||
For a short data exploration, the `autoplot()` function of the **mlr3viz** package might come in handy since it plots the response against all predictors and all predictors against all predictors (not shown). | ||
|
||
```{r autoplot, eval=FALSE} | ||
library(mlr3viz) | ||
# plot response against each predictor | ||
mlr3viz::autoplot(task, type = "duo") | ||
# plot all variables against each other | ||
|
@@ -353,7 +346,6 @@ All classification\index{classification} **learners** start with `classif.` and | |
To find out about learners that are able to model a binary response variable, we can run: | ||
|
||
```{r 12-spatial-cv-12, eval=FALSE} | ||
library(mlr3extralearners) | ||
mlr3extralearners::list_mlr3learners( | ||
filter = list(class = "classif", properties = "twoclass"), | ||
select = c("id", "mlr3_package", "required_packages")) |> | ||
|
@@ -384,7 +376,6 @@ We opt for the binomial classification\index{classification} method used in Sect | |
Additionally, we need to specify the `predict.type` which determines the type of the prediction with `prob` resulting in the predicted probability for landslide occurrence between 0 and 1 (this corresponds to `type = response` in `predict.glm`). | ||
|
||
```{r 12-spatial-cv-13, eval=FALSE} | ||
library(mlr3learners) | ||
learner = mlr3::lrn("classif.log_reg", predict_type = "prob") | ||
``` | ||
|
||
|
@@ -468,6 +459,7 @@ mean(score_spcv_glm$classif.auc) |> | |
``` | ||
|
||
To put these results in perspective, let us compare them with AUROC\index{AUROC} values from a 100-repeated 5-fold non-spatial cross-validation (Figure \@ref(fig:boxplot-cv); the code for the non-spatial cross-validation\index{cross-validation} is not shown here but will be explored in the exercise section). | ||
<!--JN: why "as expected"? I think it would be great to explain this expectation in a few sentences here...--> | ||
As expected, the spatially cross-validated result yields lower AUROC values on average than the conventional cross-validation approach, underlining the over-optimistic predictive performance due to spatial autocorrelation\index{autocorrelation!spatial} of the latter. | ||
|
||
```{r boxplot-cv, echo=FALSE, out.width="75%", fig.cap="Boxplot showing the difference in GLM AUROC values on spatial and conventional 100-repeated 5-fold cross-validation.", fig.scap="Boxplot showing AUROC values."} | ||
|
@@ -499,7 +491,7 @@ Random forest\index{random forest} models might be more popular than SVMs; howev | |
Since (spatial) hyperparameter tuning is the major aim of this section, we will use an SVM. | ||
For those wishing to apply a random forest model, we recommend to read this chapter, and then proceed to Chapter \@ref(eco) in which we will apply the currently covered concepts and techniques to make spatial predictions based on a random forest model. | ||
|
||
SVMs\index{SVM} search for the best possible 'hyperplanes' to separate classes (in a classification\index{classification} case) and estimate 'kernels' with specific hyperparameters to allow for non-linear boundaries between classes [@james_introduction_2013]. | ||
SVMs\index{SVM} search for the best possible 'hyperplanes' to separate classes (in a classification\index{classification} case) and estimate 'kernels' with specific hyperparameters to create non-linear boundaries between classes [@james_introduction_2013]. | ||
Hyperparameters\index{hyperparameter} should not be confused with coefficients of parametric models, which are sometimes also referred to as parameters.^[ | ||
For a detailed description of the difference between coefficients and hyperparameters, see the 'machine mastery' blog post on the subject. | ||
<!-- For a more detailed description of the difference between coefficients and hyperparameters, see the [machine mastery blog](https://machinelearningmastery.com/difference-between-a-parameter-and-a-hyperparameter/). --> | ||
|
@@ -516,9 +508,9 @@ The classification\index{classification} task remains the same, hence we can sim | |
Learners implementing SVM can be found using `listLearners()` as follows: | ||
|
||
```{r 12-spatial-cv-23} | ||
mlr3extralearners::list_mlr3learners() %>% | ||
.[class == "classif" & grepl("svm", id), | ||
.(id, class, mlr3_package, required_packages)] | ||
mlr3_learners = list_mlr3learners() | ||
mlr3_learners[class == "classif" & grepl("svm", id), | ||
.(id, class, mlr3_package, required_packages)] | ||
``` | ||
|
||
Of the options illustrated above, we will use `ksvm()` from the **kernlab** package [@karatzoglou_kernlab_2004]. | ||
|
@@ -574,14 +566,13 @@ talk in person (see also exercises): | |
--> | ||
|
||
```{r 12-spatial-cv-26, eval=FALSE} | ||
library("mlr3tuning") | ||
# five spatially disjoint partitions | ||
tune_level = mlr3::rsmp("spcv_coords", folds = 5) | ||
# use 50 randomly selected hyperparameters | ||
terminator = mlr3tuning::trm("evals", n_evals = 50) | ||
tuner = mlr3tuning::tnr("random_search") | ||
# define the outer limits of the randomly selected hyperparameters | ||
seach_space = paradox::ps( | ||
search_space = paradox::ps( | ||
C = paradox::p_dbl(lower = -12, upper = 15, trafo = function(x) 2^x), | ||
sigma = paradox::p_dbl(lower = -15, upper = 6, trafo = function(x) 2^x) | ||
) | ||
|
@@ -601,17 +592,17 @@ at_ksvm = mlr3tuning::AutoTuner$new( | |
``` | ||
|
||
The tuning is now set-up to fit 250 models to determine optimal hyperparameters for one fold. | ||
Repeating this for each fold, we end up with 1250 (250 \* 5) models for each repetition. | ||
Repeating this for each fold, we end up with 1,250 (250 \* 5) models for each repetition. | ||
Repeated 100 times means fitting a total of 125,000 models to identify optimal hyperparameters (Figure \@ref(fig:partitioning)). | ||
These are used in the performance estimation, which requires the fitting of another 500 models (5 folds \* 100 repetitions; see Figure \@ref(fig:partitioning)). | ||
To make the performance estimation processing chain even clearer, let us write down the commands we have given to the computer: | ||
|
||
1. Performance level (upper left part of Figure \@ref(fig:inner-outer)): split the dataset into five spatially disjoint (outer) subfolds. | ||
1. Tuning level (lower left part of Figure \@ref(fig:inner-outer)): use the first fold of the performance level and split it again spatially into five (inner) subfolds for the hyperparameter tuning. | ||
Use the 50 randomly selected hyperparameters\index{hyperparameter} in each of these inner subfolds, i.e., fit 250 models. | ||
1. Performance estimation: Use the best hyperparameter combination from the previous step (tuning level) and apply it to the first outer fold in the performance level to estimate the performance (AUROC\index{AUROC}). | ||
1. Repeat steps 2 and 3 for the remaining four outer folds. | ||
1. Repeat steps 2 to 4, 100 times. | ||
1. Performance level (upper left part of Figure \@ref(fig:inner-outer)) - split the dataset into five spatially disjoint (outer) subfolds | ||
1. Tuning level (lower left part of Figure \@ref(fig:inner-outer)) - use the first fold of the performance level and split it again spatially into five (inner) subfolds for the hyperparameter tuning. | ||
Use the 50 randomly selected hyperparameters\index{hyperparameter} in each of these inner subfolds, i.e., fit 250 models | ||
1. Performance estimation - Use the best hyperparameter combination from the previous step (tuning level) and apply it to the first outer fold in the performance level to estimate the performance (AUROC\index{AUROC}) | ||
1. Repeat steps 2 and 3 for the remaining four outer folds | ||
1. Repeat steps 2 to 4, 100 times | ||
|
||
The process of hyperparameter tuning and performance estimation is computationally intensive. | ||
To decrease model runtime, **mlr3** offers the possibility to use parallelization\index{parallelization} with the help of the **future** package. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍