Skip to content

Commit

Permalink
Merge pull request #141 from padasch/main
Browse files Browse the repository at this point in the history
Multiple improvements from reports grading and github issues
  • Loading branch information
stineb committed Jul 25, 2023
2 parents 5915520 + ad09696 commit ca27842
Show file tree
Hide file tree
Showing 21 changed files with 1,527 additions and 175 deletions.
9 changes: 8 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,11 @@ _book
_bookdown_files

# data which shouldn't be tracked
data/ddf.csv
data/ddf.csv
data/co2_mm_mlo.csv
data/daily_fluxes.csv
data/ddf_allsites_nested_joined.rds
data/solutions/rf_mod_gridsearch.rds
data/solutions/rf_mod.rds

data/tutorials/rf_mod.rds
9 changes: 5 additions & 4 deletions 01-getting_started.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -572,12 +572,13 @@ To install the `netCDF` command-line tools, follow these instructions:
- Restart RStudio if it was open during the steps above. Enter `install.packages("ncdf4")` and see if it installs it correctly.
- If installation failed, there should be a message "Installation of package ... had non-zero exit status". If so, check with teaching assistance.

> Note: *Terminal* allows you to interact with your Mac through the command line. You can open it through the Finder if you go to **Applications > Utilities > Terminal**.
> Note: The programm *Terminal* allows you to interact with your Mac through the command line and is installed automatically. You can open it through the Finder if you go to **Applications > Utilities > Terminal**.
- For MacOS users, via *MacPorts*:
- Install *xcode* via Terminal by typing in `xcode-select --install` (if not installed already).
- Check your OS Version (Apple icon in menu bar, then choose **About this Mac** and the **macOS** should be displayed). Click on the respective version for MacPorts [here](https://www.macports.org/install.php) and run the downloaded `.pkg` file.
- Install *netcdf* using `sudo port install netcdf` in the Terminal, as explained [here](https://ports.macports.org/port/netcdf/).
- Install *xcode* via the Terminal by typing in `xcode-select --install` (if not installed already).
- Then, install the package manager _Homebrew_ via the terminal code `/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"`.
- Then, install _netcdf_ via the terminal code `brew install netcdf`.
- It is possible that you still cannot install {terra} because you are missing _gdal_. If so, run `brew install gdal` in the terminal.

- For Linux users:
- Since the package is pre-installed in Linux, just type `sudo apt install gdal-bin libgdal-dev` in the terminal.
Expand Down
91 changes: 52 additions & 39 deletions 03-data_wrangling.Rmd

Large diffs are not rendered by default.

25 changes: 16 additions & 9 deletions 04-data_vis.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -588,22 +588,29 @@ This exercise explores the longest available atmospheric CO$_2$ record, obtained
- Write a function that computes a 12-month running mean of the CO$_2$ time series. The running mean for month $m$ should consider values of $m-5$ to $m+6$. Define arguments for the function that let the user specify the width of the running mean "box" (i.e., setting the $5$ and $6$ to any other integer of choice)
- Make a publication-ready figure that shows the monthly and the 12-month running mean time series of the CO$_2$ record.

> *Hint*: You don't need to clean the .txt file, find a suitable function in R.
> Hint: You don't need to clean the .txt file by hand, find a suitable function in R.
> *Hint*: Arguments to your function may be a vector of the original (monthly) data and a parameter defining the number of elements over which the mean is to be taken.
> Hint: Arguments to your function may be a vector of the original (monthly) data and a parameter defining the number of elements over which the mean is to be taken.
> *Hint*: To automatically render the time axis with ggplot, you can create a time object by combining the year and month columns: `lubridate::ymd(paste(as.character(year), "-", as.character(month), "-15"))`
> Hint: To automatically render the time axis with ggplot, you can create a time object by combining the year and month columns: `lubridate::ymd(paste(as.character(year), "-", as.character(month), "-15"))`
## Report Exercises

### Telling a story from data {-}

In the previous exercises and tutorials, you have learned how to wrangle data, fit simple linear regression models and identify outliers, create figures for temporal patterns, and develop and test hypotheses. Use these skills to tell a story about the `airquality` dataset (directly available in R, just type `datasets::airquality` into the console).
In the previous exercises and tutorials, you have learned how to wrangle data, fit simple linear regression models and identify outliers, create figures for temporal patterns, and develop and test hypotheses. Use these skills to analyze the `airquality` dataset (directly available in R, just type `datasets::airquality` into the console). The target variable of this dataset is the ozone concentration and your task is to tell a story about it. Look at the other variables in the dataset and become creative! Think of what patterns and relationships could be interesting to talk about. Your report must include the following elements:

Your solution has to include:
- A description of the `airquality` dataset (where is it from, what are the variables' units, etc.).
- A specific question that you want to answer through analyzing the data.
- At least three statistical metrics from your dataset that aid you in answering your question (e.g., mean values, ranges, etc.).
- At least three publishable figures or tables that show important relationships that aid you in answering your question (e.g., outliers, temporal patterns, scatterplots, etc.).
- Make sure to interpret and discuss your results and hypotheses. Why were you right / why were you wrong?

- At least three publishable figures that show important patterns (e.g., outliers, temporal patterns, scatterplots of correlations, etc.)
- A description of the data that includes at least three statistical metrics that are relevant for the problem (argue for why you chose these metrics).
- Interpretation and discussion of your results and hypotheses. The text alone should not exceed one A4 page (max. 400 words).
> Important: The text alone should not exceed one A4 page (max. 400 words).
> Hint: Metrics, figures, tables, etc. without any written-out explanation what they show do not count.
<!-- > Hint: Knit your file and check the html-version. Is it a readable report? If not, make it prettier and more readable! -->
> Hint: To get more background information on the data, use the help functionalities in RStudio.
> *Hint*: To get more background information on the data, use the help functionalities in RStudio.
### Deliverables for the report {.unnumbered}

Following the same requirements as mentioned in \@ref(retidy), present your solutions in a file called `re_airquality.Rmd`, save it in your `vignettes` folder alongside the HTML version, and make sure that your code is reproducible (make sure your .rmd is knittable, that all data is available, that paths to that data work, etc.).
10 changes: 8 additions & 2 deletions 07-code_management.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ You can create a local copy of your remote repository that's hosted on GitHub us
``` bash
# create a local copy of the remote github repository
git clone https://github.com/khufkens/YOUR_PROJECT.git
git clone git@github.com:khufkens/YOUR_PROJECT.git
```
You can then start working on this repository by using the modify -\> staged -\> commit -\> push workflow.
Expand Down Expand Up @@ -238,7 +238,7 @@ Create a new R project using the *git* R project template shown above.
### Collaborative Work on Github {-}
This is a team exercise, so team up with someone else in the classroom. You will learn about how to collaborate online using *git* and Github. **Note** that this is part of your final performance assessment. In your report, you have to provide the links to the two repositories on your GitHub account (see exercise instructions below): One to your own repository that your friend forked, and one to the repository that you forked from your friend. We will check each repositories' commit history to see whether this pair-coding exercise was done correctly. Therefore, follow these steps precisely:
This is a team exercise, so team up with someone else in the classroom. You will learn about how to collaborate online using *git* and Github.
> Important: When creating your repositories, make sure that you set the repository to be public and not private.
Expand All @@ -259,3 +259,9 @@ This is a team exercise, so team up with someone else in the classroom. You will
- *Voluntary:* Can you force a merge conflict, for example by editing the same file at once, and resolve?
To complete the exercise, reverse rolls between Person 1 and Person 2.
### Deliverables for the report {.unnumbered}
This pair-coding exercise is part of your final performance assessment. We will check each repositories' commit history to see whether this pair-coding exercise was done correctly. So, follow the steps above precisely!
When you submit your report by mail at the end of the course, you have to provide the links to your GitHub account, to your report repositories that holds all other report exercises, and to the two repositories that you created during this pair-coding work exercise (your repository that your friend forked and the repository that you forked from your friend). Alternatively you can also create a `./vignettes/re_paircoding.Rmd` in your report repository, where you provide these links.
33 changes: 20 additions & 13 deletions 09-supervised_ml_I.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -205,8 +205,7 @@ daily_fluxes <- read_csv("./data/FLX_CH-Dav_FLUXNET2015_FULLSET_DD_1997-2014_1-3
dplyr::mutate(TIMESTAMP = lubridate::ymd(TIMESTAMP)) |>
# set all -9999 to NA
dplyr::na_if(-9999) |> # NOTE: Newer tidyverse version no longer support this statement
# instead, use `mutate(across(where(is.numeric), ~na_if(., -9999))) |> `
mutate(across(where(is.numeric), ~na_if(., -9999))) |>
# retain only data based on >=80% good-quality measurements
# overwrite bad data with NA (not dropping rows)
Expand Down Expand Up @@ -332,8 +331,8 @@ The [{recipes}](https://recipes.tidymodels.org/) package provides an even more p
```{r warning=FALSE, message=FALSE}
pp <- recipes::recipe(GPP_NT_VUT_REF ~ SW_IN_F + VPD_F + TA_F, data = daily_fluxes_train) |>
recipes::step_center(all_numeric(), -all_outcomes()) |>
recipes::step_scale(all_numeric(), -all_outcomes())
recipes::step_center(recipes::all_numeric(), -recipes::all_outcomes()) |>
recipes::step_scale(recipes::all_numeric(), -recipes::all_outcomes())
```
The first line with the `recipe()` function call assigns *roles* to the different variables. `GPP_NT_VUT_REF` is an *outcome* (in "{recipes} speak"). Then, we used selectors to apply the recipe step to several variables at once. The first selector, `all_numeric()`, selects all variables that are either integers or real values. The second selector, `-all_outcomes()` removes any outcome (target) variables from this recipe step. The returned object `pp` does *not* contain a normalized version of the data frame `daily_fluxes_train`, but rather the information that allows us to apply a specific set of pre-processing steps also to any other data set.
Expand Down Expand Up @@ -532,14 +531,14 @@ $$
$\lambda$ is treated as a parameter that is fitted such that the resulting distribution of values $Y$ approaches the normal distribution. To specify a Box-Cox-transformation as part of the pre-processing, we can use `step_BoxCox()` from the {recipes} package.
```{r warning=FALSE, message=FALSE}
pp <- recipe(WS_F ~ ., data = daily_fluxes_train) |>
step_BoxCox(all_outcomes())
pp <- recipes::recipe(WS_F ~ ., data = daily_fluxes_train) |>
recipes::step_BoxCox(all_outcomes())
```
How do transformed values look like?
```{r warning=FALSE, message=FALSE}
prep_pp <- prep(pp, training = daily_fluxes_train |> drop_na())
prep_pp <- recipes::prep(pp, training = daily_fluxes_train |> drop_na())
daily_fluxes_baked <- bake(prep_pp, new_data = daily_fluxes_test |> drop_na())
daily_fluxes_baked |>
ggplot(aes(x = WS_F, y = ..density..)) +
Expand All @@ -550,8 +549,8 @@ daily_fluxes_baked |>
Note that the Box-Cox-transformation can only be applied to values that are strictly positive. In our example, wind speed (`WS_F`) is. If this is not satisfied, a Yeo-Johnson transformation can be applied.
```{r warning=FALSE, message=FALSE}
recipe(WS_F ~ ., data = daily_fluxes) |>
step_YeoJohnson(all_outcomes())
recipes::recipe(WS_F ~ ., data = daily_fluxes) |>
recipes::step_YeoJohnson(all_outcomes())
```
### Putting it all together (half-way)
Expand Down Expand Up @@ -588,9 +587,9 @@ daily_fluxes_test <- rsample::testing(split)
# Model and pre-processing formulation, use all variables but LW_IN_F
pp <- recipes::recipe(GPP_NT_VUT_REF ~ SW_IN_F + VPD_F + TA_F,
data = daily_fluxes_train |> drop_na()) |>
recipes::step_BoxCox(all_predictors()) |>
recipes::step_center(all_numeric(), -all_outcomes()) |>
recipes::step_scale(all_numeric(), -all_outcomes())
recipes::step_BoxCox(recipes::all_predictors()) |>
recipes::step_center(recipes::all_numeric(), -recipes::all_outcomes()) |>
recipes::step_scale(recipes::all_numeric(), -recipes::all_outcomes())
# Fit linear regression model
mod_lm <- caret::train(
Expand Down Expand Up @@ -693,7 +692,7 @@ There are no exercises with provided solutions for this Chapter.
The figures above show the evaluation of the model performances of the linear regression and the KNN model, evaluated on the training and test set. This exercise is to interpret and understand the observed differences. Implement the following points:
1. Adopt the code from this Chapter for fitting and evaluating the linear regression model and the KNN into your own RMarkdown notebook. Name the file `./vignettes/re_ml_01.Rmd`. Keep larger functions in a separate file in an appropriate directory and load the function definition as part of the RMarkdown notebook.
1. Adopt the code from this Chapter for fitting and evaluating the linear regression model and the KNN into your own RMarkdown file. Name the file `./vignettes/re_ml_01.Rmd`. Keep larger functions in a separate file in an appropriate directory and load the function definition as part of the RMarkdown.
2. Interpret observed differences in the context of the bias-variance trade-off:
- Why is the difference between the evaluation on the training and the test set larger for the KNN model than for the linear regression model?
- Why does the evaluation on the test set indicate a better model performance of the KNN model than the linear regression model?
Expand All @@ -713,6 +712,10 @@ Let's look at the role of $k$ in a KNN. Answer the following questions:
Add code and text for addressing this exercise to the file `./vignettes/re_ml_01.Rmd` and give the notebook a suitable structure for easy navigation with a table of content (`toc`) by modifying its YAML header:
> **Important:** to find an optimal $k$, you will have to use daily data and not half-hourly data!
> Hint: Do not produce various of the "Training - Test" Figures shown above to find an optimal $k$. Find a suitable plot that shows the optimal $k$ (maybe you can find one in this or another Chapter...).
``` bash
---
title: "Report Exercise Chapter 10"
Expand All @@ -722,3 +725,7 @@ output:
toc: true
---
```
### Deliverables for the report {.unnumbered}
Present your solutions in a file called `re_ml01.Rmd`, save it in your `vignettes` folder alongside the HTML version, and make sure that your code is reproducible (make sure your .rmd is knittable, that all data is available, that paths to that data work, etc.).
Loading

0 comments on commit ca27842

Please sign in to comment.