In my opinion this package still needs a unified way to evaluate DL models.
Background
As everyone knows, there are usually 3 different sets: training, validation, testing.
One trains on the training set, validates and tunes the model (i.e. the Hyperparameters) on the validation set and finally evaluates on the unseen test set.
As it is arguably the most popular BCI dataset I will make examples regarding the BCI competition IV 2a dataset. As a starting point I want to discuss WithinSubject Validation.
Regarding the BCI IV 2a dataset the train-test split is quite obvious: session_T for training and session_E for testing.
The problems come with the validation set.
Examples
-
MOABB:
Here the dataset is splitted into session_T and session_E and the classifier gets trained on session_T and validated on session_E. There exists no test set (the validation set is the test set). This will lead to better results as the "test set" is used to tune the model (Hyperparamters, EarlyStopping, etc.). The final model therefore benefits from the test data during training and the final "test" result is positively biased. Further there is a 2-fold cross validation (same training is repeated with interchanged sessions).
-
braindecode Example: same as in 1. without the 2-fold cross-validation.
-
Schirrmeister et al. 2017 Appendix: Split data into train (session_T) and test (session_E). 2 training phases:
a. train set is splitted into train and validation set (probably a random split?). The model is trained on the train split and the hyperparameters are tuned via the validation set. The final training loss of the best model is saved.
b. The best model is trained with train and validation set (complete session_T) until the training loss reaches the previous best training loss to prevent overfitting. All results are then obtained via the unseen test set (session_E).
-
"Conventional" approach: same as in 3. but without the second training phase.
Opinion
In my opinion either method 3 or 4 should be used to get reasonable results. The braindecode/BCI community would really benefit from a unified implementation of one of these (or both) methods. As method 3 adds additional complexity to method 4 it would be interesting to know, how big the performance boost of method 3 (over method 4) is (@robintibor: is it worth it?).
What is your opinion on this topic?
In my opinion this package still needs a unified way to evaluate DL models.
Background
As everyone knows, there are usually 3 different sets: training, validation, testing.
One trains on the training set, validates and tunes the model (i.e. the Hyperparameters) on the validation set and finally evaluates on the unseen test set.
As it is arguably the most popular BCI dataset I will make examples regarding the BCI competition IV 2a dataset. As a starting point I want to discuss WithinSubject Validation.
Regarding the BCI IV 2a dataset the train-test split is quite obvious:
session_Tfor training andsession_Efor testing.The problems come with the validation set.
Examples
MOABB:
Here the dataset is splitted into session_T and session_E and the classifier gets trained on session_T and validated on session_E. There exists no test set (the validation set is the test set). This will lead to better results as the "test set" is used to tune the model (Hyperparamters, EarlyStopping, etc.). The final model therefore benefits from the test data during training and the final "test" result is positively biased. Further there is a 2-fold cross validation (same training is repeated with interchanged sessions).
braindecode Example: same as in 1. without the 2-fold cross-validation.
Schirrmeister et al. 2017 Appendix: Split data into train (session_T) and test (session_E). 2 training phases:
a. train set is splitted into train and validation set (probably a random split?). The model is trained on the train split and the hyperparameters are tuned via the validation set. The final training loss of the best model is saved.
b. The best model is trained with train and validation set (complete session_T) until the training loss reaches the previous best training loss to prevent overfitting. All results are then obtained via the unseen test set (session_E).
"Conventional" approach: same as in 3. but without the second training phase.
Opinion
In my opinion either method 3 or 4 should be used to get reasonable results. The braindecode/BCI community would really benefit from a unified implementation of one of these (or both) methods. As method 3 adds additional complexity to method 4 it would be interesting to know, how big the performance boost of method 3 (over method 4) is (@robintibor: is it worth it?).
What is your opinion on this topic?