Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review of ch12 #782

Merged
merged 8 commits into from
Apr 24, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
95 changes: 43 additions & 52 deletions 12-spatial-cv.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -30,9 +30,9 @@ Required data will be attached in due course.

Statistical learning\index{statistical learning} is concerned with the use of statistical and computational models for identifying patterns in data and predicting from these patterns.
Due to its origins, statistical learning\index{statistical learning} is one of R's\index{R} great strengths (see Section \@ref(software-for-geocomputation)).^[
Applying statistical techniques to geographic data has been an active topic of research for many decades in the fields of Geostatistics, Spatial Statistics and point pattern analysis [@diggle_modelbased_2007; @gelfand_handbook_2010; @baddeley_spatial_2015].
Applying statistical techniques to geographic data has been an active topic of research for many decades in the fields of geostatistics, spatial statistics and point pattern analysis [@diggle_modelbased_2007; @gelfand_handbook_2010; @baddeley_spatial_2015].
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

]
Statistical learning\index{statistical learning} combines methods from statistics\index{statistics} and machine learning\index{machine learning} and its methods can be categorized into supervised and unsupervised techniques.
Statistical learning\index{statistical learning} combines methods from statistics\index{statistics} and machine learning\index{machine learning} and can be categorized into supervised and unsupervised techniques.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question: could we rename the chapter Geostatistical learning?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, wouldn't do that, geostatistics is basically a field of its own and we are not doing geostatistics here

Both are increasingly used in disciplines ranging from physics, biology and ecology to geography and economics [@james_introduction_2013].

This chapter focuses on supervised techniques in which there is a training dataset, as opposed to unsupervised techniques such as clustering\index{clustering}.
Expand Down Expand Up @@ -79,7 +79,7 @@ data("lsl", "study_mask", package = "spDataLarge")
ta = terra::rast(system.file("raster/ta.tif", package = "spDataLarge"))
```

This should load three objects: a `data.frame` named `lsl`, an `sf` object named `study_mask` and a `SpatRaster` (see Section \@ref(raster-classes)) named `ta` containing terrain attribute rasters.
The above code loads three objects: a `data.frame` named `lsl`, an `sf` object named `study_mask` and a `SpatRaster` (see Section \@ref(raster-classes)) named `ta` containing terrain attribute rasters.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

`lsl` contains a factor column `lslpts` where `TRUE` corresponds to an observed landslide 'initiation point', with the coordinates stored in columns `x` and `y`.^[
The landslide initiation point is located in the scarp of a landslide polygon. See @muenchow_geomorphic_2012 for further details.
]
Expand All @@ -90,28 +90,26 @@ The 175 non-landslide points were sampled randomly from the study area, with the
# library(tmap)
# data("lsl", package = "spDataLarge")
# ta = terra::rast(system.file("raster/ta.tif", package = "spDataLarge"))
# lsl_sf = sf::st_as_sf(lsl, coords = c("x", "y"), crs = 32717)
# lsl_sf = sf::st_as_sf(lsl, coords = c("x", "y"), crs = "EPSG:32717")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

# hs = terra::shade(slope = ta$slope * pi / 180,
# terra::terrain(ta$elev, v = "aspect", unit = "radians"))
# # so far tmaptools does not support terra objects
#
# rect = tmaptools::bb_poly(raster::raster(hs))
# bbx = tmaptools::bb(raster::raster(hs), xlim = c(-0.0001, 1),
# bbx = tmaptools::bb(raster::raster(hs), xlim = c(-0.0001, 1),
# ylim = c(-0.0001, 1), relative = TRUE)
#
# map = tm_shape(hs, bbox = bbx) +
# tm_grid(col = "black", n.x = 1, n.y = 1, labels.inside.frame = FALSE,
# labels.rot = c(0, 90)) +
# labels.rot = c(0, 90), lines = FALSE) +
# tm_raster(palette = gray(0:100 / 100), n = 100, legend.show = FALSE) +
# tm_shape(ta$elev) +
# tm_raster(alpha = 0.5, palette = terrain.colors(10), legend.show = FALSE) +
# tm_shape(lsl_sf) +
# tm_bubbles("lslpts", size = 0.2, palette = "-RdYlBu",
# tm_bubbles("lslpts", size = 0.2, palette = "-RdYlBu",
# title.col = "Landslide: ") +
# qtm(rect, fill = NULL) +
# tm_layout(outer.margins = c(0.04, 0.04, 0.02, 0.02), frame = FALSE) +
# tm_layout(inner.margins = 0) +
# tm_legend(bg.color = "white")
# tmap::tmap_save(map, filename = "figures/lsl-map-1.png", width = 11,
# tmap::tmap_save(map, filename = "figures/lsl-map-1.png", width = 11,
# height = 11, units = "cm")
knitr::include_graphics("figures/lsl-map-1.png")
```
Expand All @@ -121,21 +119,20 @@ The first three rows of `lsl`, rounded to two significant digits, can be found i

```{r lslsummary, echo=FALSE, warning=FALSE}
lsl_table = lsl |>
mutate(across(.cols = -any_of(c("x", "y", "lslpts")), ~signif(., 2))) |>
head(3)
knitr::kable(lsl_table, caption = "Structure of the lsl dataset.",
mutate(across(.cols = -any_of(c("x", "y", "lslpts")), ~signif(., 2)))
knitr::kable(lsl_table[c(1, 2, 350), ], caption = "Structure of the lsl dataset.",
caption.short = "`lsl` dataset.", booktabs = TRUE) |>
kableExtra::kable_styling(latex_options = "scale_down")
```

To model landslide susceptibility, we need some predictors.
Since terrain attributes are frequently associated with landsliding [@muenchow_geomorphic_2012], we have already extracted following terrain attributes from `ta` to `lsl`:

- `slope`: slope angle (°).
- `cplan`: plan curvature (rad m^−1^) expressing the convergence or divergence of a slope and thus water flow.
- `cprof`: profile curvature (rad m^-1^) as a measure of flow acceleration, also known as downslope change in slope angle.
- `elev`: elevation (m a.s.l.) as the representation of different altitudinal zones of vegetation and precipitation in the study area.
- `log10_carea`: the decadic logarithm of the catchment area (log10 m^2^) representing the amount of water flowing towards a location.
- `slope` - slope angle (°)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the previous style here, but without the full stops. So I would make it:

- `slope`: slope angle (°)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Robinlovelace The new style is consistent with the style in chapters 1-8

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Robinlovelace The new style is consistent with the style in chapters 1-8

Which parts @Nowosad ? Had a quick look at the output below.

rg ' - ' *.Rmd
README.Rmd
5:<!-- README.md is generated from README.Rmd. Please edit that file - rmarkdown::render('README.Rmd', output_format = 'github_document', output_file = 'README.md') -->
55:<!-- - Next issue  -->

index.Rmd
11:  - geocompr.bib
12:  - packages.bib

_15-ex.Rmd
145:  mtry = paradox::p_int(lower = 1, upper = ncol(task$data()) - 1),

14-location.Rmd
332:    iter = iter - 1

15-eco.Rmd
65:For this, we will make use of a random forest model\index{random forest} - a very popular machine learning\index{machine learning} algorithm [@breiman_random_2001].
520:<!-- (`r # ncol(rp) - 1`), -->
526:  mtry = paradox::p_int(lower = 1, upper = ncol(task$data()) - 1),
650:all(values(pred - pred_2) == 0)

13-transport.Rmd
66:# code that generated the input data - see also ?bristol_ways
576:# online figure - backup
609:    - Bonus: find two ways of arriving at the same answer.
617:    - Bonus: what proportion of trips cross the proposed routes?
618:    - Advanced: write code that would increase this proportion.
633:    - Bonus: develop a raster layer that divides the Bristol region into 100 cells (10 by 10) and provide a metric related to transport policy, such as number of people trips that pass through each cell by walking or the average speed limit of roads, from the `bristol_ways` dataset (the approach taken in Chapter \@ref(location)).

_12-ex.Rmd
22:    - Slope
23:    - Plan curvature
24:    - Profile curvature
25:    - Catchment area

11-algorithms.Rmd
193:abs(T1[1, 1] * (T1[2, 2] - T1[3, 2]) +
194:  T1[2, 1] * (T1[3, 2] - T1[1, 2]) +
195:  T1[3, 1] * (T1[1, 2] - T1[2, 2]) ) / 2
213:i = 2:(nrow(poly_mat) - 2)
222:  abs(x[1, 1] * (x[2, 2] - x[3, 2]) +
223:        x[2, 1] * (x[3, 2] - x[1, 2]) +
224:        x[3, 1] * (x[1, 2] - x[2, 2]) ) / 2
294:    x[1, 1] * (x[2, 2] - x[3, 2]) +
295:    x[2, 1] * (x[3, 2] - x[1, 2]) +
296:    x[3, 1] * (x[1, 2] - x[2, 2])
327:  i = 2:(nrow(poly_mat) - 2)
343:  i = 2:(nrow(poly_mat) - 2)
412:    - Which of the best practices covered in Section \@ref(scripts) does it follow?
413:    - Create a version of the script on your computer in an IDE\index{IDE} such as RStudio\index{RStudio} (preferably by typing-out the script line-by-line, in your own coding style and with your own comments, rather than copy-pasting --- this will help you learn how to type scripts). Using the example of a square polygon (e.g., created with `poly_mat = cbind(x = c(0, 0, 9, 9, 0), y = c(0, 9, 9, 0, 0))`) execute the script line-by-line.
414:    - What changes could be made to the script to make it more reproducible?
415:    <!-- - Answer: The script could state that it needs a an object called `poly_mat` to be present and, if none is present, create an example dataset at the outset for testing. -->
417:<!--     - Try to reproduce the results: how many significant earthquakes were there last month? -->
418:<!--     - Modify the script so that it provides a map with all earthquakes that happened in the past hour. -->
421:    - How could the documentation be improved?
422:  <!-- It could document the source of the data better - e.g. with `data from https://earthquake.usgs.gov/earthquakes/feed/v1.0/geojson.php` -->
424:    - Reproduce the results on your own computer with reference to the script `10-centroid-alg.R`, an implementation of this algorithm (bonus: type out the commands - try to avoid copy-pasting).
426:    - Are the results correct? Verify them by converting `poly_mat` into an `sfc` object (named `poly_sfc`) with `st_polygon()` (hint: this function takes objects of class `list()`) and then using `st_area()` and `st_centroid()`.
434:     - Bonus 1: Think about why the method only works for convex hulls and note changes that would need to be made to the algorithm to make it work for other types of polygon.
436:     - Bonus 2: Building on the contents of `10-centroid-alg.R`, write an algorithm\index{algorithm} only using base R functions that can find the total length of linestrings represented in matrix form.
439:    - Verify it works by running `poly_centroid_sf(sf::st_sf(sf::st_sfc(poly_sfc)))`
440:    - What error message do you get when you try to run `poly_centroid_sf(poly_mat)`?

10-gis.Rmd
223:In our case, three arguments seem important - `INPUT`, `OVERLAY`, and `OUTPUT`.
426:The U.S. Army - Construction Engineering Research Laboratory (USA-CERL) created the core of the Geographical Resources Analysis Support System (GRASS)\index{GRASS} [Table \@ref(tab:gis-comp); @neteler_open_2008] from 1982 to 1995. 
430:Here, we introduce **rgrass**\index{rgrass (package)} with one of the most interesting problems in GIScience - the traveling salesman problem\index{traveling salesman}.
434:In our case, the number of possible solutions correspond to `(25 - 1)! / 2`, i.e., the factorial of 24 divided by 2 (since we do not differentiate between forward or backward direction).
435:Even if one iteration can be done in a nanosecond, this still corresponds to `r format(factorial(25 - 1) / (2 * 10^9 * 3600 * 24 * 365))` years.
627:<!-- - We could have used GRASS's spatial database\index{spatial database} (based on SQLite) which allows faster processing.  -->
631:<!-- - We could have also accessed an already existing GRASS spatial database from within R. -->
635:<!-- - You can also start R from within a running GRASS\index{GRASS} session [for more information please refer to @bivand_applied_2013 and this [wiki](https://grasswiki.osgeo.org/wiki/R_statistics/rgrass7)]. -->
636:<!-- - Refer to the excellent [GRASS online help](https://grass.osgeo.org/grass77/manuals/) or `execGRASS("g.manual", flags = "i")` for more information on each available GRASS geoalgorithm\index{geoalgorithm}. -->
637:<!-- - If you would like to use GRASS 6 from within R, use the R package **spgrass6**. -->
651:<!-- source code (or docker) - https://github.com/jblindsay/whitebox-tools -->
734:#> Extent: (-180.000000, -89.900000) - (179.999990, 83.645130)
896:As a final note, if your data is getting too big for PostgreSQL/PostGIS and you require massive spatial data management and query performance, then the next logical step is to use large-scale geographic querying on distributed computing systems, as for example, provided by GeoMesa (http://www.geomesa.org/) or Apache Sedona [https://sedona.apache.org/; formermly known as GeoSpark - @huang_geospark_2017].
908:    - **RQGIS**, **RSAGA** and **rgrass7**
909:    - **sf**

08-read-write-plot.Rmd
97:<!-- - elevatr - https://github.com/jhollist/elevatr/issues/64 -->
101:<!-- - https://github.com/ErikKusch/KrigR -->
105:<!-- - https://github.com/ropensci/MODIStsp -->
213:<!-- rgee - see https://github.com/loreabad6/30DayMapChallenge/blob/main/scripts/day08_blue.R -->
223:<!-- potentially useful package - https://github.com/eblondel/geosapi -->
224:<!-- rstac - https://gist.github.com/h-a-graham/420434c158c139180f5eb82859099082, -->
413:<!-- - KEA - https://gdal.org/drivers/raster/kea.html -->
414:<!-- - sfarrow & geoparquet/pandas/GeoFeather -->
604:A KML file stores geographic information in XML format - a data format for the creation of web pages and the transfer of data in an application-independent way [@nolan_xml_2014].

09-mapping.Rmd
314:    rect(xleft = 0:(n - 1), ybottom = i - 1, xright = 1:n, ytop = i - 0.2,
317:  text(rep(-0.1, n_colors), (1: n_colors) - 0.6, labels = titles, xpd = TRUE, adj = 1)
344:```{r na-sb, message=FALSE, fig.cap="Map with additional elements - a north arrow and scale bar.", out.width="50%", fig.asp=1, fig.scap="Map with a north arrow and scale bar."}
481:```{r insetmap1, message=FALSE, fig.cap="Inset map providing a context - location of the central part of the Southern Alps in New Zealand.", fig.scap="Inset map providing a context."}
487:Inset map can be saved to file either by using a graphic device (see Section \@ref(visual-outputs)) or the `tmap_save()` function and its arguments - `insets_tm` and `insets_vp`.
809:  # abort old way of including - mixed content issues
1011:Additionally, it is possible to modify the `intermax` argument - maximum number of iterations for the cartogram transformation.
1080:    - Name two advantages of each based on the experience.
1081:    - Name three other mapping packages and an advantage of each.
1082:    - Bonus: create three more maps of Africa using these three packages.
1084:    - Bonus: improve the map aesthetics, for example by changing the legend title, class labels and color palette.
1089:    - Change the default colors to match your perception of the land cover categories
1090:    - Add a scale bar and north arrow and change the position of both to improve the map's aesthetic appeal
1091:    - Bonus: Add an inset map of Zion National Park's location in the context of the Utah state. (Hint: an object representing Utah can be subset from the `us_states` dataset.) 
1093:    - With one facet showing HDI and the other representing population growth (hint: using variables `HDI` and `pop_growth`, respectively)
1094:    - With a 'small multiple' per country
1097:    - Showing first the spatial distribution of HDI scores then population growth
1098:    - Showing each country in order
1100:    - With **tmap**
1101:    - With **mapview**
1102:    - With **leaflet**
1103:    - Bonus: For each approach, add a legend (if not automatically provided) and a scale bar
1105:    - In the city you live, for a couple of users per day
1106:    - In the country you live, for dozens of users per day
1107:    - Worldwide for hundreds of users per day and large data serving requirements
1109:    - Using `textInput()`
1110:    - Using `selectInput()`

_05-ex.Rmd
50:nrow(nz_height_near_cant) # 75 - 5 more
146:plot(srtm_resampl_all - srtm_resampl1, range = c(-300, 300))
147:plot(srtm_resampl_all - srtm_resampl2, range = c(-300, 300))
148:plot(srtm_resampl_all - srtm_resampl3, range = c(-300, 300))
149:plot(srtm_resampl_all - srtm_resampl4, range = c(-300, 300))
150:plot(srtm_resampl_all - srtm_resampl5, range = c(-300, 300))

05-geometry-operations.Rmd
235:To achieve that, each object is firstly shifted in a way that its center has coordinates of `0, 0` (`(nz_sfc - nz_centroid_sfc)`). 
241:nz_scale = (nz_sfc - nz_centroid_sfc) * 0.5 + nz_centroid_sfc
264:The `rotation` function accepts one argument `a` - a rotation angle in degrees.
269:nz_rotate = (nz_sfc - nz_centroid_sfc) * rotation(30) + nz_centroid_sfc
289:nz_scale_rotate = (nz_sfc - nz_centroid_sfc) * 0.25 * rotation(90) + nz_centroid_sfc
295:nz_shear = (nz_sfc - nz_centroid_sfc) * shearing(1.1, 0) + nz_centroid_sfc
776:- Nearest neighbor - assigns the value of the nearest cell of the original raster to the cell of the target one.
778:- Bilinear interpolation - assigns a weighted average of the four nearest cells from the original raster to the cell of the target one (Figure \@ref(fig:bilinear)). The fastest method for continuous rasters
779:- Cubic interpolation - uses values of 16 nearest cells of the original raster to determine the output cell value, applying third-order polynomial functions. Used for continuous rasters. It results in a more smoothed surface than the bilinear interpolation, but is also more computationally demanding
780:- Cubic spline interpolation - also uses values of 16 nearest cells of the original raster to determine the output cell value, but applies cubic splines (piecewise third-order polynomial functions) to derive the results. Used for continuous rasters
781:- Lanczos windowed sinc resampling - uses values of 36 nearest cells of the original raster to determine the output cell value. Used for continuous rasters^[More detailed explanation of this method can be found at https://gis.stackexchange.com/a/14361/20955.]
840:<!-- gdalUtils - https://cran.r-project.org/web/packages/gdalUtils/index.html - we mentioned it in geocompr 1; however it seems abandoned -->
841:<!-- gdalUtilities - https://cran.r-project.org/web/packages/gdalUtilities/index.html -->
842:<!-- also - add some reference to GDAL functions! -->
851:- `gdalinfo` - lists various information about a raster file, including its resolution, CRS, bounding box, and more
852:- `gdal_translate` - converts raster data between different file formats
853:- `gdal_rasterize` - converts vector data into raster files
854:- `gdalwarp` - allows for raster mosaicing, resampling, cropping, and reprojecting

_04-ex.Rmd
52:# Calculate n. points in each region - this contains the result
169:E7. Calculate the Normalized Difference Water Index	(NDWI; `(green - nir)/(green + nir)`) of a Landsat image. 
178:  (nir - red) / (nir + red)
184:    (green - nir) / (green + nir)
233:plot(distance_to_coast_km - distance_to_coast_km2)

04-spatial-operations.Rmd
1003:NDVI&= \frac{\text{NIR} - \text{Red}}{\text{NIR} + \text{Red}}\\
1014:The raster object has four satellite bands - blue, green, red, and near-infrared (NIR).
1019:  (nir - red) / (nir + red)
1064:```{r focal-example, echo = FALSE, fig.cap = "Input raster (left) and resulting output raster (right) due to a focal operation - finding the minimum value in 3-by-3 moving windows.", fig.scap="Illustration of a focal operation."}

03-attribute-operations.Rmd
550:Alternatively, we can use one of **dplyr** functions - `mutate()` or `transmute()`.

_02-ex.Rmd
16:# - Its geometry type?
18:# - The number of countries?
20:# - Its coordinate reference system (CRS)?
36:# - What does the `cex` argument do (see `?plot`)?
38:# - Why was `cex` set to the `sqrt(world$pop) / 10000`?
40:# - Bonus: experiment with different ways to visualize the global population.

01-introduction.Rmd
375:<!-- add info about specialized packages - sfnetworks, landscapemetrics, gdalcubes, rgee, etc. -->

02-spatial-data.Rmd
117:<!-- - [liblwgeom](https://github.com/postgis/postgis/tree/master/liblwgeom), a geometry engine used by PostGIS, via the [**lwgeom**](https://r-spatial.github.io/lwgeom/) package -->
423:# not printed - enough of these figures already (RL)
450:# Plotted - it is referenced in ch5 (st_cast)
1042:The degree of compression is often referred to as *flattening*, defined in terms of the equatorial radius ($a$) and polar radius ($b$) as follows: $f = (a - b) / a$. The terms *ellipticity* and *compression* can also be used.
1055:Both datums in Figure \@ref(fig:datum-fig) are put on top of a geoid - a model of global mean sea level.^[Please note that the geoid on the Figure exaggerates the bumpy surface of the geoid by a factor of 10,000 to highlight the irregular shape of the planet.]
1075:There are three main groups of projection types - conic, cylindrical, and planar (azimuthal).

_03-ex.Rmd
131:  mutate(pop_dens_diff_10_15 = pop_dens_15 - pop_dens_10,
136:E11. Change the columns' names in `us_states` to lowercase. (Hint: helper functions - `tolower()` and `colnames()` may help.)
144:The new object should have only two variables - `median_income_15` and `geometry`.
159:  mutate(pov_change = poverty_level_15 - poverty_level_10)
165:  mutate(pov_pct_change = pov_pct_15 - pov_pct_10)
196:r[c(1, 9, 81 - 9 + 1, 81)]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I based it on the example from ch5.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the style is fine but the dash in

  • slope - slope angle (°)

Does not seem standard. Same with

  • Nearest neighbor - assigns the value of the nearest cell of the original raster to the cell of the target one. It is fast and usually suitable for categorical rasters

Also if there are full stops (periods) in the bullet points there should a bullet point at the end: https://www.instructionalsolutions.com/blog/bulleted-list-punctuation

Regarding colons vs dashes, I think both

  • slope --- slope angle (°)

and

  • slope: slope angle (°)

Would be right with the former being an 'em dash'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Robinlovelace I am fine with colons -- we just need to use them consistently.

- `cplan` - plan curvature (rad m^−1^) expressing the convergence or divergence of a slope and thus water flow
- `cprof` - profile curvature (rad m^-1^) as a measure of flow acceleration, also known as downslope change in slope angle
- `elev` - elevation (m a.s.l.) as the representation of different altitudinal zones of vegetation and precipitation in the study area
- `log10_carea` - the decadic logarithm of the catchment area (log10 m^2^) representing the amount of water flowing towards a location

It might be a worthwhile exercise to compute the terrain attributes with the help of R-GIS bridges (see Chapter \@ref(gis)) and extract them to the landslide points (see Exercise section at the end of this Chapter).

Expand All @@ -158,7 +155,7 @@ It is worth understanding each of the three input arguments:

- A formula, which specifies landslide occurrence (`lslpts`) as a function of the predictors
- A family, which specifies the type of model, in this case `binomial` because the response is binary (see `?family`)
- The data frame which contains the response and the predictors
- The data frame which contains the response and the predictors (as columns)

The results of this model can be printed as follows (`summary(fit)` provides a more detailed account of the results):

Expand All @@ -179,7 +176,7 @@ head(pred_glm)

Spatial predictions can be made by applying the coefficients to the predictor rasters.
This can be done manually or with `terra::predict()`.
In addition to a model object (`fit`), this function also expects a `SpatRaster` with the predictors named as in the model's input data frame (Figure \@ref(fig:lsl-susc)).
In addition to a model object (`fit`), this function also expects a `SpatRaster` with the predictors (raster layers) named as in the model's input data frame (Figure \@ref(fig:lsl-susc)).

```{r 12-spatial-cv-9, eval=FALSE}
# making the prediction
Expand All @@ -191,16 +188,15 @@ pred = terra::predict(ta, model = fit, type = "response")
# data("lsl", "study_mask", package = "spDataLarge")
# ta = terra::rast(system.file("raster/ta.tif", package = "spDataLarge"))
# study_mask = terra::vect(study_mask)
# # white raster to only plot the axis ticks, otherwise gridlines would be visible
# lsl_sf = sf::st_as_sf(lsl, coords = c("x", "y"), crs = 32717)
# hs = terra::shade(ta$slope * pi / 180,
# terra::terrain(ta$elev, v = "aspect", unit = "radians"))
# rect = tmaptools::bb_poly(raster::raster(hs))
# bbx = tmaptools::bb(raster::raster(hs), xlim = c(-0.00001, 1),
# ylim = c(-0.00001, 1), relative = TRUE)
# bbx = tmaptools::bb(raster::raster(hs), xlim = c(-0.0001, 1),
# ylim = c(-0.0001, 1), relative = TRUE)
#
# map = tm_shape(hs, bbox = bbx) +
# tm_grid(col = "black", n.x = 1, n.y = 1, labels.inside.frame = FALSE,
# labels.rot = c(0, 90)) +
# labels.rot = c(0, 90), lines = FALSE) +
# tm_raster(palette = "white", legend.show = FALSE) +
# # hillshade
# tm_shape(terra::mask(hs, study_mask), bbox = bbx) +
Expand All @@ -210,12 +206,10 @@ pred = terra::predict(ta, model = fit, type = "response")
# tm_shape(terra::mask(pred, study_mask)) +
# tm_raster(alpha = 0.5, palette = "Reds", n = 6, legend.show = TRUE,
# title = "Susceptibility") +
# # rectangle and outer margins
# qtm(rect, fill = NULL) +
# tm_layout(outer.margins = c(0.04, 0.04, 0.02, 0.02), frame = FALSE,
# legend.position = c("left", "bottom"),
# legend.title.size = 0.9)
# tmap::tmap_save(map, filename = "figures/lsl-susc-1.png", width = 11,
# tm_layout(legend.position = c("left", "bottom"),
# legend.title.size = 0.9,
# inner.margins = 0)
# tmap::tmap_save(map, filename = "figures/lsl-susc-1.png", width = 11,
# height = 11, units = "cm")
knitr::include_graphics("figures/lsl-susc-1.png")
```
Expand Down Expand Up @@ -313,12 +307,12 @@ Third, the **resampling** approach assesses the predictive performance of the mo
To implement a GLM\index{GLM} in **mlr3**\index{mlr3 (package)}, we must create a **task** containing the landslide data.
Since the response is binary (two-category variable) and has a spatial dimension, we create a classification\index{classification} task with `TaskClassifST$new()` of the **mlr3spatiotempcv** package [@schratz_mlr3spatiotempcv_2021, for non-spatial tasks, use `mlr3::TaskClassif$new()` or `mlr3::TaskRegr$new()` for regression\index{regression} tasks, see `?Task` for other task types].^[The **mlr3** ecosystem makes heavily use of **data.table** and **R6** classes. And though you might use **mlr3** without knowing the specifics of **data.table** or **R6**, it might be rather helpful. To learn more about **data.table**, please refer to https://rdatatable.gitlab.io/data.table/index.html. To learn more about **R6**, we recommend [Chapter 14](https://adv-r.hadley.nz/fp.html) of the Advanced R book [@wickham_advanced_2019].]
The first essential argument of these `Task*$new()` functions is `backend`.
`backend` expects the data to be used for the modeling including the response and predictor variables.
`backend` expects that the input data includes the response and predictor variables.
The `target` argument indicates the name of a response variable (in our case this is `lslpts`) and `positive` determines which of the two factor levels of the response variable indicate the landslide initiation point (in our case this is `TRUE`).
All other variables of the `lsl` dataset will serve as predictors.
For spatial CV, we need to provide a few extra arguments (`extra_args`).
The `coordinate_names` argument expects the names of the coordinate columns (see Section \@ref(intro-cv) and Figure \@ref(fig:partitioning)).
Additionally, one should indicate the used CRS (`crs`) and if one wishes to use the coordinates as predictors in the modeling (`coords_as_features`).
Additionally, we should indicate the used CRS (`crs`) and decide if we want to use the coordinates as predictors in the modeling (`coords_as_features`).

```{r 12-spatial-cv-11, eval=FALSE}
# create task
Expand All @@ -330,17 +324,16 @@ task = mlr3spatiotempcv::TaskClassifST$new(
extra_args = list(
coordinate_names = c("x", "y"),
coords_as_features = FALSE,
crs = 32717)
crs = "EPSG:32717")
)
```

Note that `TaskClassifST$new()` also accepts an `sf`-object as input for the `backend` parameter.
In this case, you might only want to specify the `coords_as_features` argument of the `extra_args` list.
We did not convert `lsl` into an `sf`-object because `TaskClassifST$new()` just converts it back into a non-spatial `data.table` object in the background.
We did not convert `lsl` into an `sf`-object because `TaskClassifST$new()` would just turn it back into a non-spatial `data.table` object in the background.
For a short data exploration, the `autoplot()` function of the **mlr3viz** package might come in handy since it plots the response against all predictors and all predictors against all predictors (not shown).

```{r autoplot, eval=FALSE}
library(mlr3viz)
# plot response against each predictor
mlr3viz::autoplot(task, type = "duo")
# plot all variables against each other
Expand All @@ -353,7 +346,6 @@ All classification\index{classification} **learners** start with `classif.` and
To find out about learners that are able to model a binary response variable, we can run:

```{r 12-spatial-cv-12, eval=FALSE}
library(mlr3extralearners)
mlr3extralearners::list_mlr3learners(
filter = list(class = "classif", properties = "twoclass"),
select = c("id", "mlr3_package", "required_packages")) |>
Expand Down Expand Up @@ -384,7 +376,6 @@ We opt for the binomial classification\index{classification} method used in Sect
Additionally, we need to specify the `predict.type` which determines the type of the prediction with `prob` resulting in the predicted probability for landslide occurrence between 0 and 1 (this corresponds to `type = response` in `predict.glm`).

```{r 12-spatial-cv-13, eval=FALSE}
library(mlr3learners)
learner = mlr3::lrn("classif.log_reg", predict_type = "prob")
```

Expand Down Expand Up @@ -468,6 +459,7 @@ mean(score_spcv_glm$classif.auc) |>
```

To put these results in perspective, let us compare them with AUROC\index{AUROC} values from a 100-repeated 5-fold non-spatial cross-validation (Figure \@ref(fig:boxplot-cv); the code for the non-spatial cross-validation\index{cross-validation} is not shown here but will be explored in the exercise section).
<!--JN: why "as expected"? I think it would be great to explain this expectation in a few sentences here...-->
As expected, the spatially cross-validated result yields lower AUROC values on average than the conventional cross-validation approach, underlining the over-optimistic predictive performance due to spatial autocorrelation\index{autocorrelation!spatial} of the latter.

```{r boxplot-cv, echo=FALSE, out.width="75%", fig.cap="Boxplot showing the difference in GLM AUROC values on spatial and conventional 100-repeated 5-fold cross-validation.", fig.scap="Boxplot showing AUROC values."}
Expand Down Expand Up @@ -499,7 +491,7 @@ Random forest\index{random forest} models might be more popular than SVMs; howev
Since (spatial) hyperparameter tuning is the major aim of this section, we will use an SVM.
For those wishing to apply a random forest model, we recommend to read this chapter, and then proceed to Chapter \@ref(eco) in which we will apply the currently covered concepts and techniques to make spatial predictions based on a random forest model.

SVMs\index{SVM} search for the best possible 'hyperplanes' to separate classes (in a classification\index{classification} case) and estimate 'kernels' with specific hyperparameters to allow for non-linear boundaries between classes [@james_introduction_2013].
SVMs\index{SVM} search for the best possible 'hyperplanes' to separate classes (in a classification\index{classification} case) and estimate 'kernels' with specific hyperparameters to create non-linear boundaries between classes [@james_introduction_2013].
Hyperparameters\index{hyperparameter} should not be confused with coefficients of parametric models, which are sometimes also referred to as parameters.^[
For a detailed description of the difference between coefficients and hyperparameters, see the 'machine mastery' blog post on the subject.
<!-- For a more detailed description of the difference between coefficients and hyperparameters, see the [machine mastery blog](https://machinelearningmastery.com/difference-between-a-parameter-and-a-hyperparameter/). -->
Expand All @@ -516,9 +508,9 @@ The classification\index{classification} task remains the same, hence we can sim
Learners implementing SVM can be found using `listLearners()` as follows:

```{r 12-spatial-cv-23}
mlr3extralearners::list_mlr3learners() %>%
.[class == "classif" & grepl("svm", id),
.(id, class, mlr3_package, required_packages)]
mlr3_learners = list_mlr3learners()
mlr3_learners[class == "classif" & grepl("svm", id),
.(id, class, mlr3_package, required_packages)]
```

Of the options illustrated above, we will use `ksvm()` from the **kernlab** package [@karatzoglou_kernlab_2004].
Expand Down Expand Up @@ -574,14 +566,13 @@ talk in person (see also exercises):
-->

```{r 12-spatial-cv-26, eval=FALSE}
library("mlr3tuning")
# five spatially disjoint partitions
tune_level = mlr3::rsmp("spcv_coords", folds = 5)
# use 50 randomly selected hyperparameters
terminator = mlr3tuning::trm("evals", n_evals = 50)
tuner = mlr3tuning::tnr("random_search")
# define the outer limits of the randomly selected hyperparameters
seach_space = paradox::ps(
search_space = paradox::ps(
C = paradox::p_dbl(lower = -12, upper = 15, trafo = function(x) 2^x),
sigma = paradox::p_dbl(lower = -15, upper = 6, trafo = function(x) 2^x)
)
Expand All @@ -601,17 +592,17 @@ at_ksvm = mlr3tuning::AutoTuner$new(
```

The tuning is now set-up to fit 250 models to determine optimal hyperparameters for one fold.
Repeating this for each fold, we end up with 1250 (250 \* 5) models for each repetition.
Repeating this for each fold, we end up with 1,250 (250 \* 5) models for each repetition.
Repeated 100 times means fitting a total of 125,000 models to identify optimal hyperparameters (Figure \@ref(fig:partitioning)).
These are used in the performance estimation, which requires the fitting of another 500 models (5 folds \* 100 repetitions; see Figure \@ref(fig:partitioning)).
To make the performance estimation processing chain even clearer, let us write down the commands we have given to the computer:

1. Performance level (upper left part of Figure \@ref(fig:inner-outer)): split the dataset into five spatially disjoint (outer) subfolds.
1. Tuning level (lower left part of Figure \@ref(fig:inner-outer)): use the first fold of the performance level and split it again spatially into five (inner) subfolds for the hyperparameter tuning.
Use the 50 randomly selected hyperparameters\index{hyperparameter} in each of these inner subfolds, i.e., fit 250 models.
1. Performance estimation: Use the best hyperparameter combination from the previous step (tuning level) and apply it to the first outer fold in the performance level to estimate the performance (AUROC\index{AUROC}).
1. Repeat steps 2 and 3 for the remaining four outer folds.
1. Repeat steps 2 to 4, 100 times.
1. Performance level (upper left part of Figure \@ref(fig:inner-outer)) - split the dataset into five spatially disjoint (outer) subfolds
1. Tuning level (lower left part of Figure \@ref(fig:inner-outer)) - use the first fold of the performance level and split it again spatially into five (inner) subfolds for the hyperparameter tuning.
Use the 50 randomly selected hyperparameters\index{hyperparameter} in each of these inner subfolds, i.e., fit 250 models
1. Performance estimation - Use the best hyperparameter combination from the previous step (tuning level) and apply it to the first outer fold in the performance level to estimate the performance (AUROC\index{AUROC})
1. Repeat steps 2 and 3 for the remaining four outer folds
1. Repeat steps 2 to 4, 100 times

The process of hyperparameter tuning and performance estimation is computationally intensive.
To decrease model runtime, **mlr3** offers the possibility to use parallelization\index{parallelization} with the help of the **future** package.
Expand Down