New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add possibility to do cross validation split only for train and dev #507
Conversation
@johann-petrak what do you think about this? Since you implemented the first version. |
We are open to changing xval in general but the implementation how it is now is conceptually correct. I need to understand more what you are doing in this PR. So you want to keep test fixed, train models on different splits of data and evaluate always on test? |
@PhilipMay could you please explain again what you think is a bug? The very purpose of crossvalidation (like hold-out validation, leave-one-out validation, or the boostrap estimate) is to estimate the performance of a model trained on all of the available data (and the estimation error) as unbiased and consistently as possible. The method for crossvalidation is to split the data into k parts where each part is becoming the validation set in an iteration to evaluate the model trained on all k-1 other parts. Do you disagree with that description of how crossvalidation is supposed to work or does the current implementation not conform to that description? |
Ok - this is a difficult topic to talk about by textchat but I will try. This is how I want to do cross validation and how it is shown by sklearn and linked by @Timoeller. My PR makes this possible but the current implementation makes it impossible. I want to merge all data (but the test set) and then split that merged data into train and validation (dev). This is done in every fold as shown on the grafic below. I train a model on train and evaluate it against validation (dev) on each fold. The evaluation is done with the Citation from sklearn: "The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop." Then the final evaluation is against test set. As sklean sais: _"A test set should still be held out for final evaluation". _ see https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-evaluating-estimator-performance But the current implementation in FARM works like this (see picture below): It merges all data (also the test set). Then it does not split to train and validation (dev) but it splits to train and test (which is not how sklean sais it is done. Then the train is split to train and valitation (dev). This way you get 5 differnt test sets which is definitly not what I want to do. I have one test set and want it to stay as it is. I really think the discussion is very difficult by just text. I suggest to set up a video conference with us 3 (@Timoeller and @johann-petrak) if you agree. I would then summarize our conclusion here on GitHub. What do you think? |
Wow, thanks for the detailed graphical explanation. To begin, I think we both have the correct understanding of how CV should work. To get to schema 1 with current FARM you can set test_filename=None. That way you can decide what goes into the CV because many datasets already contain a train/dev/test split (+ potentially a hidden test set). You might want to use all 3 dataset splits when doing CV. |
@PhilipMay I think I understand what you are suggesting. Again, crossvalidation is a statistical process to estimate a performance metric for a model that we could train on all the data used for the crossvalidation. What is confusing here is that the names for which parts of the data to hold back as "unseen" when is not always clear. But nothing of this has to do anything with how crossvalidation works as an estimator: take a set of labeled examples and estimate a performance metric for an ML algorithm on them. Those labeled examples could come from what is your "pre-defined" training set only, a combination of training set and dev set, or everything you have, depending on which phase of the model development you are in. The DataSiloForCrossVal.make method already lets you choose just a subset of the "pre-defined" training/dev/test sets to use for the xval. So you could use just train and dev to get the estimate, then train a model on train/dev and evaluate on the test set which would then have been ignored for the estimate. So I guess the confusion arises from the fact that FARM expects/supports data to already be divided into pre-defined train/dev/test sets, but those divisions are usually just used in the final step and in the context of paper writing. Algorithm selection, hyperparameter optimization, architecture search and final deployment need to be much more flexible in how to manage data into "pure unseen" and "used-up" parts anyways. |
Ok - maybe we should talk about what I need (want) to do and not about what is wrong or right. UseCase 1:I have two datasets. A big one and a test set. So I want to use FARM to split my big dataset into train and dev Train is used for training and dev for EarlyStopping and to evaluate the model each Because the CV is implemented as it is now in FARM (described several times above) I I do not want to split it to 3 datasets but only to 2 (train and dev). This current 2nd UseCaseIn early stage of a project where I only have a low number of hand labeled data I only want This usecase is also not covered by the current implementation in FARM.
That is not right. If I do that the data I give to |
@PhilipMay if I understand you correctly you want to use a single set as both dev and validation set for each fold? I assume you mean that for each split, one would create a training set and a validation set and then use the validation set both as dev set and for estimation (which already happens for the early stopping decision so no additional validation step would actually be needed)? This is definitely not a proper way to do performance estimation since what you get will be biased and I think it should not get called "crossvalidation estimate". I can see that it might be tempting to try this in practice if very short on data but I am not sure if it should be part of the library as such. |
Hmm - somehow we relly seem to have a huge gap in our understanding of different things. For me dev and validation set is absolutly the same they are synonyms for me. And these 2 references seem have the same understanding:
and here
@johann-petrak So you say I need 4 sets right? Train, test, dev and validation? I never heared about that in the literature. Could you eventually provide a link?
And you say that when you do early stopping and evaluation on the same set you overfit that set on each fold and so not get representative results? Is that what you mean? |
@johann-petrak in case you want to do early stopping and validation on different datasets wouldnt it be better to do a nested cross validation? Especially if you have a small dataset? |
@johann-petrak and @Timoeller since I agree to the fact that it is dangerous to do evaluation and early stopping on the same dataset I added the possibility to do nested cross validation. Can you please have a look at PR #508 Thanks |
There is no universal definition of terms here unfortunately, remember that many classical algorithms do not need a dev set at all and remember that historically especially in statistics, there was no practice of handing around datasets with a fixed "test" or even "dev" set at all. It helps to point out which model we are talking about: the result of xval is NOT a model, but an estimate because in order to get that estimate, k models have been trained and evaluated. To train and evaluate each of these models we use the same approach which we would use if we just did this once: train on the training set and evaluate on the per-fold "test" set which is usually called evaluation or validation set in the context of xval. Now when we do architecture search, hyperparameter search and all that things become more complicated and we may indeed need to do nested estimation. I just think that the scenarios where where and how you need to do nested evaluations are so diverse that it would be good to think a bit more about the exact way for how to do that and especially combine it with 1) other estimation strategies like hold-out estimates and 2) the ability to have some way to also define the strategy of how to use those estimates, e.g. through greedy or beam-search hyperparameter estimation etc. I am just a random outside contributor to FARM myself, so I will leave all that and also the decision of whether or how to use the suggested PR to the FARM developers. |
@johann-petrak thanks for your feedback again. Now I understand why you do this cross validation the way you do. An other thought.FARM forces the user to load it's data from disk. There is no way to just give different datasets (numpy or pandas) to it. That way the user could just decide which way of cross validation the user wants to do. Leave one out, nested, all fancy stuff. If FARM would open up to just pass in different datasets we would not be forced to do crossvalidation inside FARM but you could code it around / outside of FARM. |
I hope we had more "random outside contributors" like you @johann-petrak We took the discussion offline (in a remote call of course) and are aligned about the type of CV Johan implemented and its correctness. We leave the code as is. As mentioned by Philip, we need a way to handle the (sometimes) strong variance that come from multiple model restarts. Philip is implementing a solution in #508 Closing this PR now, though I hope new generations will be able to learn something from the discussion recorded here : ) |
This adds the possibility to do cross validation splits only for train and dev dataset
and keep test as it is (if specified). This is done by adding just one additional parameter called
train_dev_split_only
which isFalse
by default. The way this has been implementeddoes not introduce any breaking changes.
The original implementation (when
train_dev_split_only
isFalse
) is done this was:sets
dev_split
value to split a dev set away from the train setThis implementation is something I personaly never saw and I would call it a bug.
But since this is just my opinion I just added this alternative way because for my
project I need it this way.
When
train_dev_split_only
isTrue
this happens:sets
This is the way how cross validation should work for me. If I also want to add the test
set to crossvalidation I would use nested cross validation - but that is something different.
Please give me feedback what you think about this PR. If you agree I will add a test and we are ready to merge.
Todo