Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unified Validation Scheme #378

Closed
martinwimpff opened this issue May 3, 2022 · 13 comments
Closed

Unified Validation Scheme #378

martinwimpff opened this issue May 3, 2022 · 13 comments
Assignees
Labels
documentation Improvements or additions to documentation enhancement New feature or request question Further information is requested

Comments

@martinwimpff
Copy link
Collaborator

In my opinion this package still needs a unified way to evaluate DL models.

Background
As everyone knows, there are usually 3 different sets: training, validation, testing.
One trains on the training set, validates and tunes the model (i.e. the Hyperparameters) on the validation set and finally evaluates on the unseen test set.

As it is arguably the most popular BCI dataset I will make examples regarding the BCI competition IV 2a dataset. As a starting point I want to discuss WithinSubject Validation.
Regarding the BCI IV 2a dataset the train-test split is quite obvious: session_T for training and session_E for testing.
The problems come with the validation set.

Examples

  1. MOABB:
    Here the dataset is splitted into session_T and session_E and the classifier gets trained on session_T and validated on session_E. There exists no test set (the validation set is the test set). This will lead to better results as the "test set" is used to tune the model (Hyperparamters, EarlyStopping, etc.). The final model therefore benefits from the test data during training and the final "test" result is positively biased. Further there is a 2-fold cross validation (same training is repeated with interchanged sessions).

  2. braindecode Example: same as in 1. without the 2-fold cross-validation.

  3. Schirrmeister et al. 2017 Appendix: Split data into train (session_T) and test (session_E). 2 training phases:
    a. train set is splitted into train and validation set (probably a random split?). The model is trained on the train split and the hyperparameters are tuned via the validation set. The final training loss of the best model is saved.
    b. The best model is trained with train and validation set (complete session_T) until the training loss reaches the previous best training loss to prevent overfitting. All results are then obtained via the unseen test set (session_E).

  4. "Conventional" approach: same as in 3. but without the second training phase.

Opinion
In my opinion either method 3 or 4 should be used to get reasonable results. The braindecode/BCI community would really benefit from a unified implementation of one of these (or both) methods. As method 3 adds additional complexity to method 4 it would be interesting to know, how big the performance boost of method 3 (over method 4) is (@robintibor: is it worth it?).

What is your opinion on this topic?

@agramfort
Copy link
Collaborator

@martinwimpff I recently had this conversation with @robintibor. To me yes we should offer and recommend this possibility. It’s the text book / correct way to teach people. Now for BCI dataset following what @robintibor had told me we used a fixed number of epochs with @cedricrommel @jpaillard and it barely overfits when we take too many epochs. So at least for this data it seems « ok » to have no validation set to early stop the training.

Cc @cedricrommel @jpaillard

@bruAristimunha
Copy link
Collaborator

@martinwimpff , Does the answer satisfy your question?

@bruAristimunha bruAristimunha added the question Further information is requested label Jul 23, 2022
@martinwimpff
Copy link
Collaborator Author

@bruAristimunha: Yes it does answer my question. But we should discuss how to implement that correctly (with skorch?) as there are multiple solutions to the problem.

@bruAristimunha
Copy link
Collaborator

bruAristimunha commented Jul 24, 2022

We can build a tutorial or a page to illustrate how to do these different validation processes. For example, I think the standard presentation in braindecode consists of only training validation for BCI. At the same time, another tutorial presents the cross-validation scheme like the recent one, Hyperparameter tuning with scikit-learn.

The significant point is that this is a methodology choice, so it might be nice to explain the options to the developer that uses a braindecode. Do you have availability for this? A page that centralizes the process requires some images and links to tutorials. In comparison, a tutorial consists of a notebook with details.

@martinwimpff
Copy link
Collaborator Author

In my opinion it isn't a real choice. If you do not use an extra validation set for training your DL model the test set performance is not as meaningful as it should be. So I consider using only a train-test split a methodological error. The significance of this error depends on the rest of the framework.

In my opinion there are three ways to do this:

  1. Using your linked method but without different Hyperparameters and more splits (i.e. 5 fold to get 80-20%). What should be mentioned is that one would have to take the average test score over all folds (not only the best one).
  2. The "simple" skorch way. Either pass the validation set with keyword train_split or let skorch do the splitting (of the training data). Use the test set afterwards to score the result. Pro: Ease of use. Contra: only one fold instead of all five(significance depending on the dataset/dataset size)
  3. Extended skorch evaluation. This would be the most complex way to do it but also the most extendable way as one is able to add separate logging, evaluation, confusion_matrix etc.
    This would look something like this:
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
for fold_idx, (train_idx, val_idx) in enumerate(kfold.split(train_dataset)):
            # divide train dataset
            train_subset = Subset(train_dataset, train_idx)
            val_subset = Subset(train_dataset, val_idx)
            clf_clone = clone(clf)
            clf_clone.set_params(train_split=predefined_split(val_subset))
            # train model
            clf_clone.fit(train_subset, y=None)
            # log/investigate/access history
            hist = clf_clone.history
            # score test data
            test_acc = clf_clone.score(X_test, y=y_test)
            # delete clone
            del clf_clone

Please let me know what you think about that.

@bruAristimunha
Copy link
Collaborator

I think we're in agreement, @martinwimpff. I like all three options, but we may have to separate a part of what we now call the validation set for the BCI for validation/calibration and another part for testing. I think that this way, we may avoid the "data leak".

For the next step, we can build a PR with a simple page for "validation of your model". In this way, other people can contribute. We can fix other tutorials in the process.

Something in the line of a very simplified version of Cross-Validation: https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation

To create this page, you need to follow the contribution guide a write a .rst page. Exemple in the braindecode https://github.com/braindecode/braindecode/blob/master/docs/help.rst
Example in mne: https://github.com/mne-tools/mne-python/blob/main/doc/overview/cookbook.rst

Contribution guide: https://github.com/braindecode/braindecode/blob/master/CONTRIBUTING.md

@martinwimpff
Copy link
Collaborator Author

I think that the best option would be to define no fixed validation set as this would make a real CrossValidation (with different train-val splits per fold) impossible.
So here is my solution (for the BCI Competition Dataset):
The data from session_T is used for training and validation. I would suggest a 5-fold CrossValidation i.e. 80-20 split. I would not pay attention to the different runs (each session consists of 6 runs a 48 trials) as this adds unnecessary complexity to the CV. The exact splitting (which samples belong to train/val during which fold) can be controlled either via the seed or via setting shuffle=False. However this exact splitting should not have a significant impact on the performance (otherwise the whole procedure would be a failure).
Test set is session_E. This part remains untouched during the complete training/tuning phase.

I would start with option 1. as there is already some similar code.
It would be beneficial to define a fixed validation set for option 2. (maybe thats what you meant?). I would suggest 80-20 split of session_T and no shuffling. This 2nd option should also be included as it is the fastest one with as little overhead as possible.

Should this whole validation scheme be more of a (standalone) tutorial or should it be added as a separate module with a tutorial?

@bruAristimunha
Copy link
Collaborator

We can start with a tutorial and write a module together. It's a very important subject :) What do you think @agramfort?

@martinwimpff
Copy link
Collaborator Author

Alright!:)
One last thing: After an internal discussion with a colleague (about CV) I would like to have your opinion about the "best" art and manner to do CrossValidation with BCI data. Depending on the data and application domain and the community there are different ways to do CV. The sklearn page also states different methods.

As there usually is a separate test session (usually recorded on a different day, for the BCIC dataset this would be session_E). This test session should always be held out completely to avoid any kind of data leaks. This is different to many other domains like vision, NLP, etc. where the whole splitting procedure is more straightforward.
Now there are four ways to tune the model:

  1. Split the training data into train-val (i.e. 80-20) once. Train the model once and test on the test data once. Pro: simplicity, Contra: tuning (heavily) depends on the train-val split.
  2. Do a k-fold CrossValidation where you split the training data into k equal folds and train the model k times on k-1 folds with the remaining fold as validation set. Use the average validation performance to find the best parameters. Train k models with these parameters and evaluate each model on the test set (k values) and use 2a the average or 2b the best score. Pro: probably the "most correct" way to do CV, should reduce the dependency on the train-val split by a lot. Contra: Evaluating all k models on the test set seems like an overkill and a bit unnecessary. The validation splits are not "fully used" which is a little bit inefficient and might lead to worse performance due to the typically small datasets.
  3. Do a "Schirrmeister" k-fold CrossValidation where you split the training data into k equal folds and train the model k times on k-1 folds with the remaining fold as validation set. Use the average validation performance to find the best parameters. Then use these parameters and train a model with all k folds as the training set. Evaluate this model on the test dataset. Pro: Best use of the available data. Contra: Not textbook-like :)
  4. Hybrid solution: Make a k-fold CV like in 2. to find the parameters. Then train one Model on k-1 folds with these parameters (like in 1.). Evaluate this model on the test data. Basically the same as 3. but without using all k folds for the final training. This option is only mentioned for completeness but I do not think that it should be used as it is also not textbook-like and has no pros over method 3.

I know that I am repeating myself here but I really think that this is a really important topic to make BCI research more reproducible and comparable. In my opinion the community could really benfit if we could agree to one method here.
Let me know what you think about this.

@sliwy
Copy link
Collaborator

sliwy commented Jul 25, 2022

I agree on the importance of the topic, crucial part when comparing models but usually underestimated and done wrong way. Interesting paper on the topic: https://hal.archives-ouvertes.fr/hal-01332785/file/paper.pdf

I think that going with option 2 or 3 sounds reasonable. The 3rd option is less heavy thus probably preferable in the case of tutorial. No strong opinion between 2 and 3.

On the other hand I would get back to splitting the data into folds. I would consider k-fold cross-validation but not with random selection of observations to folds but instead keeping the temporal order of the samples (blockwise CV). In the case of 5-fold CV, it would mean that 1st 20% of the signal goes to the 1st fold, 2nd 20% to the 2nd fold and so on. Usually in the case of totally random CV I got much higher performance due to correlation between samples that does not correspond to the accuracy on unseen dataset recorded later.

@bruAristimunha bruAristimunha added enhancement New feature or request documentation Improvements or additions to documentation labels Jul 25, 2022
@robintibor
Copy link
Contributor

Thanks for this interesting discussion. As Alex wrote, I consider using the validation set for early stopping as probably unnecessary for most current deep learning models. I agree that one needs some form of validation data different from the (final) evaluation data to tune hyperparameters. I think a blockwise k-fold splitting keeping chronologically neighbouring data in the same fold is potentially the most generic way to do it.

We sometimes used only a single validation fold at the end of training for two reasons: 1) Having a validation where we also predict only future data 2) To speed things up for large datasets, where cross validation may take a long time.

But ya as the basic example for braindecode on BCIC IV 2a cross validation could be fine for me as well :)

@martinwimpff
Copy link
Collaborator Author

Thanks for the valuable input!
So to wrap this up (and to make it as simple & fast as possible):

There should be a clear separation between "normal" train_test and HP tuning. As @agramfort stated out, EarlyStopping/the number of epochs is not a big issue so we should just use a fixed number of epochs to keep everything simple.

So for "normal" training/for the final evaluation:

  • train_test: simple division into session_T (train) and session_E (test), no validation set. This procedure should only be used for the final evaluation.

For HP tuning:

  • k-Fold CrossValidation: only use session_T, split into k folds (not shuffled in time), k-1 splits for train and 1 split for validation. Search (Grid, Random, Bayes, whatever) for the best HP configuration by using the average over the k folds
  • fast HP tuning: same as above, but just use the first split of the k-Fold CV to speed up the tuning process. This method should be used over k-Fold CV if either a) training duration is long or b) HP search space is very large (i.e. preliminary experiments)

The best HP configuration can then be evaluated by the train_test procedure above.

These are options 1. and 3. from above but splitted into 3 separate procedures.
@bruAristimunha: do you agree?

@bruAristimunha
Copy link
Collaborator

Hi @martinwimpff,
I agree! If you need any help getting started with PR, please let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants