-
Notifications
You must be signed in to change notification settings - Fork 3
/
vignette_cv_stacking_indica.Rmd
198 lines (147 loc) · 9.03 KB
/
vignette_cv_stacking_indica.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
---
title: "Evaluation of the model performance of a stacked model using CV0 on a rice dataset for phenotypic prediction"
author: "Cathy Westhues"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
link-citations: true
bibliography: REFERENCES.bib
vignette: >
%\VignetteIndexEntry{Evaluation of the model performance of a stacked model using CV0 on a rice dataset for phenotypic prediction}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
nocite: '@*'
---
```{r global_options, include = FALSE}
knitr::opts_chunk$set(comment = "#", collapse = TRUE)
```
# Step 1: Specifying input data and processing parameters
First, we create an object of class \code{METData} with the function [`create_METData()`](https://cathyjubin.github.io/learnMET/reference/create_METData.html).
The user must provide as input data genotypic and phenotypic data, as well as basic information about the field experiments (e.g. longitude, latitude data at least), and possibly environmental covariates (if available). These input data are checked and warning messages are given as output if the data are not correctly formatted. \
In this example, we use an indica rice dataset from Monteverde et al. (2019), which is implemented in the package as a "toy dataset".
From this study, a multi-year dataset of rice trials containing phenotypic data (four traits), genotypic and environmental data for a panel of indica genotypes across three years in a single location is available (more information on the datasets with `?pheno_indica`,`?geno_indica`,`?map_indica`,`?climate_variables_indica`,`?info_environments_indica`).\
In this case, environmental covariates by growth stage are directly available and can be used in predictions. These data should be provided as input in `create_METData()` using the argument *climate_variables*.
Hence, there is no need to retrieve with the package any daily weather data (hence *compute_climatic_ECs* argument set as FALSE).\
Do not forget to indicate where the plots of clustering analyses should be saved using the argument *path_to_save*.
```{r, message=FALSE, warning=FALSE}
library(learnMET)
data("geno_indica")
data("map_indica")
data("pheno_indica")
data("info_environments_indica")
data("climate_variables_indica")
METdata_indica <-
learnMET::create_METData(
geno = geno_indica,
pheno = pheno_indica,
climate_variables = climate_variables_indica,
compute_climatic_ECs = F,
info_environments = info_environments_indica,
map = map_indica,
path_to_save = '~/learnMET_analyses/indica/stacking'
)
```
The function `print.summary.METData()` gives an overview of the MET data created.\
```{r}
summary(METdata_indica)
```
# Step 2: Cross-validated model evaluation
## Specifying processing parameters and the type of prediction model
The goal of `predict_trait_MET_cv()` is to assess a given prediction method using a certain type of cross-validation (CV) scenario on the complete training set. The CV schemes covered by the package correspond to those generally evaluated in related literature on MET (@jarquin2014reaction; @jarquin2017increasing;@costa2021nonlinear).\
Here, we will use the CV0: predicting the performance of genotypes in untested environment(s), i.e. no phenotypic observations from these environment(s) included in the training set, and more specifically, the leave-one-year-out scenario. \
The function `predict_trait_MET_cv()` also allows to specify a specific subset of environmental variables from the METData$env_data object to be used in model fitting and predictions via the argument `list_env_predictors`.
**How does `predict_trait_MET_cv()` works?**
When `predict_trait_MET_cv()` is executed, a list of training/test splits is constructed according to the CV scheme chosen by the user. Each training set in each sub-element of this list is processed (e.g. standardization and removal of predictors with null variance.
The function applies a nested CV to obtain an unbiased generalization performance estimate, implying an inner loop CV nested in an outer CV. The inner loop is used for model selection, i.e. hyperparameter tuning with Bayesian optimization, while the outer loop is responsible for evaluation of model performance. Additionnally, there is a possibility for the user to specify a seed to allow reproducibility of analyses. If not provided, a random one is generated and provided in the results file.\
In `predict_trait_MET_cv()`, predictive ability is always calculated within the same environment (location–year combination), regardless of how the test sets are defined according to the different CV schemes.
Let's use a meta-learner to predict a phenotypic trait!\
```{r, echo = FALSE}
knitr::include_graphics("metalearner_cv0.png", dpi = 150)
```
Model stacking corresponds to an ensemble method which exploits many well-working models on a classification or regression task by learning how to combine outputs of these base learners to create a new model. Combined models can be very different: for instance, K-nearest neighbors, support vector machines and random forest models, trained on resamples with different hyperparameters can be used jointly in a stacked generalization ensemble.\
In the example below, we run a cross-validation of type CV0 with this method:
```{r eval=FALSE}
rescv0_1 <- predict_trait_MET_cv(
METData = METdata_indica,
trait = 'GC',
prediction_method = 'stacking_reg_1',
use_selected_markers = F,
lat_lon_included = F,
year_included = F,
cv_type = 'cv0',
cv0_type = 'leave-one-year-out',
kernel_G = 'linear',
include_env_predictors = T,
save_processing = T,
seed = 100,
path_folder = '~/INDICA/stacking/cv0'
)
```
Note that many methods for processing data based on user-defined parameters and machine learning-based methods are using functions from the tidymodels (https://www.tidymodels.org/) collection of R packages (@tidymodels). We use here in particular the package **stacks** (https://stacks.tidymodels.org/index.html), incorporated in the tidymodels
framework to build these stacking models.
## Extraction of results from the output
Let's have a look now at the structure of the output **rescv0_1**.
```{r include = FALSE}
rescv0_1 <- readRDS('met_cv.RDS')
```
The object of class `r class(rescv0_1)` contains 3 elements: `r names(rescv0_1)[1]`, `r names(rescv0_1)[2]` and `r names(rescv0_1)[3]`.\
### list_results_cv
The first element `r names(rescv0_1)[1]` is a list of `r class(rescv0_1$list_results_cv[[1]])` objects, and has a length equal to the number of training/test sets partitions, which is determined by the cross-validation scenario.\
Above, we ran `predict_trait_MET_cv()` with a leave-one-year-out CV scheme, and the dataset contains three years of
observations. Therefore, we expect the length of \code{rescv0_1$list_results_cv} to be equal to 3:
```{r}
cat(length(rescv0_1$list_results_cv))
cat(class(rescv0_1$list_results_cv[[1]]))
```
Within each of this `r class(rescv0_1$list_results_cv[[1]])` object, 9 elements are provided:
```{r, echo=FALSE}
cat(paste(names(rescv0_1$list_results_cv[[1]])),sep = ' ; ')
```
The first sub-element precises which prediction method was called:
```{r}
cat(paste(rescv0_1 $list_results_cv[[1]]$prediction_method))
```
For example, to retrieve the data.frame containing predictions for the test set, we extract the **predictions_df** subelement (predictions in column named ".pred"):
```{r}
predictions_year_2010 <- rescv0_1 $list_results_cv[[1]]$predictions_df
predictions_year_2011 <- rescv0_1 $list_results_cv[[2]]$predictions_df
predictions_year_2012 <- rescv0_1 $list_results_cv[[3]]$predictions_df
```
To directly look at the Pearson correlations between predicted and observed values for each environment, we extract the **cor_pred_obs**.
```{r}
cor_year_2010 <- rescv0_1 $list_results_cv[[1]]$cor_pred_obs
cor_year_2011 <- rescv0_1 $list_results_cv[[2]]$cor_pred_obs
cor_year_2012 <- rescv0_1 $list_results_cv[[3]]$cor_pred_obs
head(cor_year_2010)
head(cor_year_2011)
head(cor_year_2012)
```
To get the root mean square error between predicted and observed values for each environment, we extract the **rmse_pred_obs**.
```{r}
rmse_year_2010 <- rescv0_1 $list_results_cv[[1]]$rmse_pred_obs
rmse_year_2011 <- rescv0_1 $list_results_cv[[2]]$rmse_pred_obs
rmse_year_2012 <- rescv0_1 $list_results_cv[[3]]$rmse_pred_obs
head(rmse_year_2010)
head(rmse_year_2011)
head(rmse_year_2012)
```
For a stacked prediction model, one can have a look at the coefficients assigned by the LASSO metalearner to the predictions obtained from the single base learners (here, support vector machine regression models) which respectively use genomic information:
```{r}
head(rescv0_1 $list_results_cv[[3]]$parameters_collection_G)
```
... and based on environmental information:
```{r}
head(rescv0_1 $list_results_cv[[3]]$parameters_collection_E)
```
### seed_used
To get the seed used for building the train/test partitions (especially useful for CV1 and CV2):
```{r}
cat(rescv0_1 $seed_used)
```
### cv_type
To get the type of CV used in the evaluation by `predict_trait_MET_cv()`:
```{r}
cat(rescv0_1 $cv_type)
```
# Step 3: Comes soon!
# Step 4: Comes soon!
# References