-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unified Validation Scheme #378
Comments
@martinwimpff I recently had this conversation with @robintibor. To me yes we should offer and recommend this possibility. It’s the text book / correct way to teach people. Now for BCI dataset following what @robintibor had told me we used a fixed number of epochs with @cedricrommel @jpaillard and it barely overfits when we take too many epochs. So at least for this data it seems « ok » to have no validation set to early stop the training. |
@martinwimpff , Does the answer satisfy your question? |
@bruAristimunha: Yes it does answer my question. But we should discuss how to implement that correctly (with skorch?) as there are multiple solutions to the problem. |
We can build a tutorial or a page to illustrate how to do these different validation processes. For example, I think the standard presentation in braindecode consists of only training validation for BCI. At the same time, another tutorial presents the cross-validation scheme like the recent one, Hyperparameter tuning with scikit-learn. The significant point is that this is a methodology choice, so it might be nice to explain the options to the developer that uses a braindecode. Do you have availability for this? A page that centralizes the process requires some images and links to tutorials. In comparison, a tutorial consists of a notebook with details. |
In my opinion it isn't a real choice. If you do not use an extra validation set for training your DL model the test set performance is not as meaningful as it should be. So I consider using only a train-test split a methodological error. The significance of this error depends on the rest of the framework. In my opinion there are three ways to do this:
Please let me know what you think about that. |
I think we're in agreement, @martinwimpff. I like all three options, but we may have to separate a part of what we now call the validation set for the BCI for validation/calibration and another part for testing. I think that this way, we may avoid the "data leak". For the next step, we can build a PR with a simple page for "validation of your model". In this way, other people can contribute. We can fix other tutorials in the process. Something in the line of a very simplified version of Cross-Validation: https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation To create this page, you need to follow the contribution guide a write a .rst page. Exemple in the braindecode https://github.com/braindecode/braindecode/blob/master/docs/help.rst Contribution guide: https://github.com/braindecode/braindecode/blob/master/CONTRIBUTING.md |
I think that the best option would be to define no fixed validation set as this would make a real CrossValidation (with different train-val splits per fold) impossible. I would start with option 1. as there is already some similar code. Should this whole validation scheme be more of a (standalone) tutorial or should it be added as a separate module with a tutorial? |
We can start with a tutorial and write a module together. It's a very important subject :) What do you think @agramfort? |
Alright!:) As there usually is a separate test session (usually recorded on a different day, for the BCIC dataset this would be session_E). This test session should always be held out completely to avoid any kind of data leaks. This is different to many other domains like vision, NLP, etc. where the whole splitting procedure is more straightforward.
I know that I am repeating myself here but I really think that this is a really important topic to make BCI research more reproducible and comparable. In my opinion the community could really benfit if we could agree to one method here. |
I agree on the importance of the topic, crucial part when comparing models but usually underestimated and done wrong way. Interesting paper on the topic: https://hal.archives-ouvertes.fr/hal-01332785/file/paper.pdf I think that going with option 2 or 3 sounds reasonable. The 3rd option is less heavy thus probably preferable in the case of tutorial. No strong opinion between 2 and 3. On the other hand I would get back to splitting the data into folds. I would consider k-fold cross-validation but not with random selection of observations to folds but instead keeping the temporal order of the samples (blockwise CV). In the case of 5-fold CV, it would mean that 1st 20% of the signal goes to the 1st fold, 2nd 20% to the 2nd fold and so on. Usually in the case of totally random CV I got much higher performance due to correlation between samples that does not correspond to the accuracy on unseen dataset recorded later. |
Thanks for this interesting discussion. As Alex wrote, I consider using the validation set for early stopping as probably unnecessary for most current deep learning models. I agree that one needs some form of validation data different from the (final) evaluation data to tune hyperparameters. I think a blockwise k-fold splitting keeping chronologically neighbouring data in the same fold is potentially the most generic way to do it. We sometimes used only a single validation fold at the end of training for two reasons: 1) Having a validation where we also predict only future data 2) To speed things up for large datasets, where cross validation may take a long time. But ya as the basic example for braindecode on BCIC IV 2a cross validation could be fine for me as well :) |
Thanks for the valuable input! There should be a clear separation between "normal" train_test and HP tuning. As @agramfort stated out, EarlyStopping/the number of epochs is not a big issue so we should just use a fixed number of epochs to keep everything simple. So for "normal" training/for the final evaluation:
For HP tuning:
The best HP configuration can then be evaluated by the train_test procedure above. These are options 1. and 3. from above but splitted into 3 separate procedures. |
Hi @martinwimpff, |
In my opinion this package still needs a unified way to evaluate DL models.
Background
As everyone knows, there are usually 3 different sets: training, validation, testing.
One trains on the training set, validates and tunes the model (i.e. the Hyperparameters) on the validation set and finally evaluates on the unseen test set.
As it is arguably the most popular BCI dataset I will make examples regarding the BCI competition IV 2a dataset. As a starting point I want to discuss WithinSubject Validation.
Regarding the BCI IV 2a dataset the train-test split is quite obvious:
session_T
for training andsession_E
for testing.The problems come with the validation set.
Examples
MOABB:
Here the dataset is splitted into session_T and session_E and the classifier gets trained on session_T and validated on session_E. There exists no test set (the validation set is the test set). This will lead to better results as the "test set" is used to tune the model (Hyperparamters, EarlyStopping, etc.). The final model therefore benefits from the test data during training and the final "test" result is positively biased. Further there is a 2-fold cross validation (same training is repeated with interchanged sessions).
braindecode Example: same as in 1. without the 2-fold cross-validation.
Schirrmeister et al. 2017 Appendix: Split data into train (session_T) and test (session_E). 2 training phases:
a. train set is splitted into train and validation set (probably a random split?). The model is trained on the train split and the hyperparameters are tuned via the validation set. The final training loss of the best model is saved.
b. The best model is trained with train and validation set (complete session_T) until the training loss reaches the previous best training loss to prevent overfitting. All results are then obtained via the unseen test set (session_E).
"Conventional" approach: same as in 3. but without the second training phase.
Opinion
In my opinion either method 3 or 4 should be used to get reasonable results. The braindecode/BCI community would really benefit from a unified implementation of one of these (or both) methods. As method 3 adds additional complexity to method 4 it would be interesting to know, how big the performance boost of method 3 (over method 4) is (@robintibor: is it worth it?).
What is your opinion on this topic?
The text was updated successfully, but these errors were encountered: