Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postmerge fixing Ecology chapter #776

Merged
merged 6 commits into from
Apr 19, 2022
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
16 changes: 8 additions & 8 deletions 15-eco.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ To do so, we will bring together concepts presented in previous chapters and eve
Fog oases are one of the most fascinating vegetation formations we have ever encountered.
These formations, locally termed *lomas*, develop on mountains along the coastal deserts of Peru and Chile.^[Similar vegetation formations develop also in other parts of the world, e.g., in Namibia and along the coasts of Yemen and Oman [@galletti_land_2016].]
The deserts' extreme conditions and remoteness provide the habitat for a unique ecosystem, including species endemic to the fog oases.
Despite the arid conditions and low levels of precipitation of around 30-50 mm per year on average, fog deposition increases the amount of water available to plants during austal winter.
Despite the arid conditions and low levels of precipitation of around 30-50 mm per year on average, fog deposition increases the amount of water available to plants during austral winter.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for typo fixes

This results in green southern-facing mountain slopes along the coastal strip of Peru (Figure \@ref(fig:study-area-mongon)).
This fog, which develops below the temperature inversion caused by the cold Humboldt current in austral winter, provides the name for this habitat.
Every few years, the El Niño phenomenon brings torrential rainfall to this sun-baked environment [@dillon_lomas_2003].
Expand Down Expand Up @@ -424,7 +424,7 @@ text(tree_mo, pretty = 0)
dev.off()
```

```{r tree, echo=FALSE, fig.cap="Simple example of a decision tree with three internal nodes and four terminal nodes.", fig.scap="Simple example of a decision tree."}
```{r tree, echo=FALSE, fig.cap="Simple example of a decision tree with three internal nodes and four terminal nodes.", out.width="60%", fig.scap="Simple example of a decision tree."}
knitr::include_graphics("figures/15_tree.png")
```

Expand Down Expand Up @@ -486,7 +486,7 @@ task = mlr3spatiotempcv::TaskRegrST$new(

Using an `sf` object as the backend automatically provides the geometry information needed for the spatial partitioning later on.
Additionally, we got rid of the columns `id` and `spri` since these variables should not be used as predictors in the modeling.
Next, we go on to contruct the a random forest\index{random forest} learner from the **ranger** package.
Next, we go on to construct the a random forest\index{random forest} learner from the **ranger** package.

```{r 15-eco-21, eval=FALSE}
lrn_rf = lrn("regr.ranger", predict_type = "response")
Expand Down Expand Up @@ -519,7 +519,7 @@ search_space = paradox::ps(
Having defined the search space, we are all set for specifying our tuning via the `AutoTuner()` function.
Since we deal with geographic data, we will again make use of spatial cross-validation to tune the hyperparameters\index{hyperparameter} (see Sections \@ref(intro-cv) and \@ref(spatial-cv-with-mlr)).
Specifically, we will use a five-fold spatial partitioning with only one repetition (`rsmp()`).
In each of these spatial partitions, we run 50 models (`trm()`) while using randomly selected hyperparameter configurations (`tnr`) within predefined limits (`seach_space`) to find the optimal hyperparameter\index{hyperparameter} combination.
In each of these spatial partitions, we run 50 models (`trm()`) while using randomly selected hyperparameter configurations (`tnr()`) within predefined limits (`seach_space`) to find the optimal hyperparameter\index{hyperparameter} combination.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definite improvement, follow-on question, worth explaining in more detail what these functions are, I'm new to them and am not sure from this good but terse description.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will reference the spatial-cv chapter as I have explained there in a little more detail how to construct an AutoTuner().

The performance measure is the root mean squared error (RMSE\index{RMSE}).

```{r 15-eco-22, eval=FALSE}
Expand All @@ -541,7 +541,7 @@ at = mlr3tuning::AutoTuner$new(
Calling the `train()`-method of the `AutoTuner`-object finally runs the hyperparameter\index{hyperparameter} tuning, and will find the optimal hyperparameter\index{hyperparameter} combination for the specified parameters.

```{r 14-eco-24, eval=FALSE}
# hyperparamter tuning
# hyperparameter tuning
set.seed(0412022)
at$train(task)
#>...
Expand All @@ -562,12 +562,12 @@ saveRDS(at, "extdata/15-tune.rds")
```

```{r 15-eco-26, echo=FALSE, eval=FALSE}
tune = readRDS("extdata/15-tune.rds")
at = readRDS("extdata/15-tune.rds")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does at stand for? Autotune? may be worth stating that somewhere or using a longer and more descriptive object name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, at stands for AutoTuner and I guess this was more obvious while creating the object: https://github.com/Robinlovelace/geocompr/blob/672991f23a7115b77c8ee199bdf8353b84f289f0/15-eco.Rmd#L526-L538
Still, the name could be more descriptive, something like autotuner_rf where rf stands for random forest.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, the name could be more descriptive, something like autotuner_rf where rf stands for random forest.

Agreed, I see now the tests are unrelated to your changes Jannes, so I suggest merging this now to keep the momentum. Many thanks!

```

An `mtry` of 4, a `sample.fraction` of 0.9, and a `min.node.size` of 7 represent the best hyperparameter\index{hyperparameter} combination.
An RMSE\index{RMSE} of
<!-- `r # round(tune$tuning_result$regr.rmse, 2)` -->
<!-- `r # round(at$tuning_result$regr.rmse, 2)` -->
0.38
is relatively good when considering the range of the response variable which is
<!-- `r # round(diff(range(rp$sc)), 2)` -->
Expand All @@ -591,7 +591,7 @@ Given a multilayer `SpatRaster` containing rasters named as the predictors used
pred = terra::predict(ep, model = at, fun = predict)
```

```{r rf-pred, echo=FALSE, fig.cap="Predictive mapping of the floristic gradient clearly revealing distinct vegetation belts.", fig.width = 10, fig.height = 10, fig.scap="Predictive mapping of the floristic gradient."}
```{r rf-pred, echo=FALSE, fig.cap="Predictive mapping of the floristic gradient clearly revealing distinct vegetation belts.", out.width="60%", fig.scap="Predictive mapping of the floristic gradient."}
# # restrict the prediction to your study area
# pred = terra::mask(pred, terra::vect(study_area)) |>
# terra::trim()
Expand Down
12 changes: 8 additions & 4 deletions _15-ex.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ E1. Run a NMDS\index{NMDS} using the percentage data of the community matrix.
Report the stress value and compare it to the stress value as retrieved from the NMDS using presence-absence data.
What might explain the observed difference?

```{r 15-ex-e1, eval=FALSE}
```{r 15-ex-e1, message=FALSE}
data("comm", package = "spDataLarge")
pa = decostand(comm, "pa")
pa = pa[rowSums(pa) != 0, ]
Expand All @@ -35,6 +35,7 @@ nmds_pa$stress
nmds_per$stress
```

```{asis, message=FALSE}
The NMDS using the presence-absence values yields a better result (`nmds_pa$stress`) than the one using percentage data (`nmds_per$stress`).
This might seem surprising at first sight.
On the other hand, the percentage matrix contains both more information and more noise.
Expand All @@ -47,6 +48,7 @@ The point here is that percentage data as specified during a field campaign migh
This again introduces noise which in turn will worsen the ordination result.
Still, it is a valuable information if one species had a higher frequency or coverage in one plot than another compared to just presence-absence data.
One compromise would be to use a categorical scale such as the Londo scale.
```

E2. Compute all the predictor rasters\index{raster} we have used in the chapter (catchment slope, catchment area), and put them into a `SpatRaster`-object.
Add `dem` and `ndvi` to it.
Expand All @@ -55,7 +57,7 @@ Finally, construct a response-predictor matrix.
The scores of the first NMDS\index{NMDS} axis (which were the result when using the presence-absence community matrix) rotated in accordance with elevation represent the response variable, and should be joined to `random_points` (use an inner join).
To complete the response-predictor matrix, extract the values of the environmental predictor raster object to `random_points`.

```{r 15-ex-e2, eval=FALSE}
```{r 15-ex-e2}
# first compute the terrain attributes we have also used in the chapter
library(dplyr)
library(terra)
Expand Down Expand Up @@ -118,7 +120,7 @@ Parallelize\index{parallelization} the tuning level (see Section \@ref(svm)).
Report the mean RMSE\index{RMSE} and use a boxplot to visualize all retrieved RMSEs.
Please not that this exercise is best solved using the mlr3 functions `benchmark_grid()` and `benchmark()` (see https://mlr3book.mlr-org.com/perf-eval-cmp.html#benchmarking for more information).

```{r 15-ex-e3, eval=FALSE}
```{r 15-ex-e3, message=FALSE}
library(dplyr)
library(future)
library(mlr3)
Expand Down Expand Up @@ -190,7 +192,7 @@ tictoc::toc()
# stop parallelization
future:::ClusterRegistry("stop")
# save your result, e.g. to
saveRDS(bmr, file = "extdata/15_bmr.rds")
# saveRDS(bmr, file = "extdata/15_bmr.rds")

# mean RMSE
bmr$aggregate(measures = msr("regr.rmse"))
Expand All @@ -213,5 +215,7 @@ ggplot(data = d, mapping = aes(x = learner_id, y = regr.rmse)) +
labs(y = "RMSE", x = "model")
```

```{asis, message=FALSE}
In fact, `lm` performs at least as good the random forest model, and thus should be preferred since it is much easier to understand and computationally much less demanding (no need for fitting hyperparameters).
But keep in mind that the used dataset is small in terms of observations and predictors and that the response-predictor relationships are also relatively linear.
```