Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training set, validation set, and test set #475

Closed
chenyongpeng1 opened this issue Jun 11, 2024 · 3 comments
Closed

training set, validation set, and test set #475

chenyongpeng1 opened this issue Jun 11, 2024 · 3 comments

Comments

@chenyongpeng1
Copy link

chenyongpeng1 commented Jun 11, 2024

Hello, biomod2 team!
Species records are randomly divided into a training set (75% of the data) for model calibration and a test set (25% of the data) for validation.

--This is somewhat misleading. Typically there should be three dataset parts: : a training set, a validation set, and a test set. The validation set is used to fine-tune the model's hyperparameters, while the test set is used to assess the model accuracy to unseen data.
Whether biomod2 contains validation sets. If not, can you add a validation set and use techniques such as cross-validation to mitigate the overfitting problem?

@MayaGueguen
Copy link
Contributor

Hello Chenyong 👋

Actually, we do have calibration, validation and evaluation datasets 🙂

  • while calibration and validation will come from your original dataset
  • evaluation is a completely different dataset that will only be used, if provided, to compute evaluation metrics using models built over the original dataset

Datasets are given to the BIOMOD_FormatingData :

  • original dataset is given through resp.var, expl.var, resp.xy parameters
  • while evaluation dataset is provided through eval.resp.var, eval.expl.var, eval.resp.xy parameters

Cut of the original dataset into calibration and validation is done within BIOMOD_Modeling function through CV.[...] parameters. (This is done by calling the bm_CrossValidation function, which you can also use by your own if necessary)

ℹ️ Note that there are several possibilities to build your calibration and validation datasets. You can find more details within the Cross-validation vignette.

When calling to get_evaluations function to retrieve your evaluation values, you will see 3 columns in your output table : calibration, validation and evaluation. They will only be filled if the corresponding dataset was provided, so evaluation will be empty if you did not provide evaluation dataset to BIOMOD_FormatingData, and validation column will be empty if for example you ask to build a model with all the data (CV.do.full.models = TRUE in BIOMOD_Modeling).

➡️ Note also that we have plenty of tutorial and documentation onto our website, and you can have an overview of the package functions within this presentation. So do not hesitate to have a look 👀

Hope it helps,
Maya

@chenyongpeng1
Copy link
Author

chenyongpeng1 commented Jun 12, 2024

Hello,Maya
Thank you for your patience and attention, and I have studied a lot.
but I still have some questions:

  1. Can the original data set be automatically divided into three parts through the CV Function, for example, one part is calibration, one part is validation, and one part is test or evaluation according to the ratio of 70:10:20?and eval.resp.var, eval.expl.var, eval.resp.xy parameters are clearly the same as before.
  2. I'm a little confused:biomod2 allows you to use different strategies to separate your data into a calibration dataset and a validation dataset for the cross-validation. But I have confused that,in this picture,which is validation set?In other words,validation dataset in Biomod2 is equivalent to validation fold from training set,or test set in this picture.
    image

Looking forward to your reply! Thanks again!

Best wishes,
Chenyongpeng

@MayaGueguen
Copy link
Contributor

Hello Chenyongpeng,

Can the original data set be automatically divided into three parts through the CV Function, for example, one part is calibration, one part is validation, and one part is test or evaluation according to the ratio of 70:10:20?and eval.resp.var, eval.expl.var, eval.resp.xy parameters are clearly the same as before.

No. As the evaluation dataset is supposed to be independant data from the one used to build the models, original data can only be divided in calibration and validation through biomod2.

I'm a little confused:biomod2 allows you to use different strategies to separate your data into a calibration dataset and a validation dataset for the cross-validation. But I have confused that,in this picture,which is validation set?In other words,validation dataset in Biomod2 is equivalent to validation fold from training set,or test set in this picture.

In your picture :

  • Training Data set = what I called original data in previous answer
    • Training fold = calibration dataset in biomod2
    • Validation fold = validation dataset in biomod2
  • Test Data set = evaluation dataset in biomod2

Hope it helps,
Maya

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants