Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add possibility to do cross validation split only for train and dev #507

Closed
wants to merge 2 commits into from

Conversation

PhilipMay
Copy link
Contributor

@PhilipMay PhilipMay commented Aug 31, 2020

This adds the possibility to do cross validation splits only for train and dev dataset
and keep test as it is (if specified). This is done by adding just one additional parameter called
train_dev_split_only which is False by default. The way this has been implemented
does not introduce any breaking changes.

The original implementation (when train_dev_split_only is False) is done this was:

  • concat all datasets given in sets
  • split concatenation into train and test folds (5 by default)
  • then use the dev_split value to split a dev set away from the train set

This implementation is something I personaly never saw and I would call it a bug.
But since this is just my opinion I just added this alternative way because for my
project I need it this way.

When train_dev_split_only is True this happens:

  • concat all datasets given in sets
  • split concatenation into train and dev folds (5 by default)
  • if test is specified it always returns the same test set in all "splits"

This is the way how cross validation should work for me. If I also want to add the test
set to crossvalidation I would use nested cross validation - but that is something different.

Please give me feedback what you think about this PR. If you agree I will add a test and we are ready to merge.

Todo

  • write test

@PhilipMay
Copy link
Contributor Author

@johann-petrak what do you think about this? Since you implemented the first version.

@Timoeller
Copy link
Contributor

We are open to changing xval in general but the implementation how it is now is conceptually correct.

I need to understand more what you are doing in this PR. So you want to keep test fixed, train models on different splits of data and evaluate always on test?
If that is the case I would not like to include it, because this is exactly not how I understand crossvalidation. See first sentence on sklearn: "Learning the parameters of a prediction function and testing it on the same data is a methodological mistake"

@johann-petrak
Copy link
Contributor

@PhilipMay could you please explain again what you think is a bug?

The very purpose of crossvalidation (like hold-out validation, leave-one-out validation, or the boostrap estimate) is to estimate the performance of a model trained on all of the available data (and the estimation error) as unbiased and consistently as possible. The method for crossvalidation is to split the data into k parts where each part is becoming the validation set in an iteration to evaluate the model trained on all k-1 other parts.
With methods like neural nets, the training may need a "development set" which has to get split off the k-1 parts and belongs to the "seen data" for training, while the kth path is the "unseen data" used to estimate the performance of the kth model.

Do you disagree with that description of how crossvalidation is supposed to work or does the current implementation not conform to that description?

@PhilipMay
Copy link
Contributor Author

PhilipMay commented Aug 31, 2020

Ok - this is a difficult topic to talk about by textchat but I will try.

This is how I want to do cross validation and how it is shown by sklearn and linked by @Timoeller. My PR makes this possible but the current implementation makes it impossible. I want to merge all data (but the test set) and then split that merged data into train and validation (dev). This is done in every fold as shown on the grafic below. I train a model on train and evaluate it against validation (dev) on each fold. The evaluation is done with the EarlyStopping class after evaluate_every steps are done.

Citation from sklearn: "The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop."

Then the final evaluation is against test set.

As sklean sais: _"A test set should still be held out for final evaluation". _ see https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-evaluating-estimator-performance

grafik

But the current implementation in FARM works like this (see picture below):

It merges all data (also the test set). Then it does not split to train and validation (dev) but it splits to train and test (which is not how sklean sais it is done. Then the train is split to train and valitation (dev). This way you get 5 differnt test sets which is definitly not what I want to do. I have one test set and want it to stay as it is.

grafik

I really think the discussion is very difficult by just text. I suggest to set up a video conference with us 3 (@Timoeller and @johann-petrak) if you agree. I would then summarize our conclusion here on GitHub.

What do you think?

@Timoeller
Copy link
Contributor

Wow, thanks for the detailed graphical explanation. To begin, I think we both have the correct understanding of how CV should work.

To get to schema 1 with current FARM you can set test_filename=None. That way you can decide what goes into the CV because many datasets already contain a train/dev/test split (+ potentially a hidden test set). You might want to use all 3 dataset splits when doing CV.

@johann-petrak
Copy link
Contributor

@PhilipMay I think I understand what you are suggesting.

Again, crossvalidation is a statistical process to estimate a performance metric for a model that we could train on all the data used for the crossvalidation. What is confusing here is that the names for which parts of the data to hold back as "unseen" when is not always clear.
This is made more confusing by the fact that some ML datasets come with "preset" divisions into train/dev/test sets for easier comparison.

But nothing of this has to do anything with how crossvalidation works as an estimator: take a set of labeled examples and estimate a performance metric for an ML algorithm on them.

Those labeled examples could come from what is your "pre-defined" training set only, a combination of training set and dev set, or everything you have, depending on which phase of the model development you are in.
In a very early stage it could even come from just a subset of the training set so you have enough unseen data for algorithm comparison, hyper-parameter optimization, and whatnot. Right before deployment you probably want to include all the data you have for training the final model, including the test set and estimate the perfomance by doing xval on all data as well.

The DataSiloForCrossVal.make method already lets you choose just a subset of the "pre-defined" training/dev/test sets to use for the xval. So you could use just train and dev to get the estimate, then train a model on train/dev and evaluate on the test set which would then have been ignored for the estimate.

So I guess the confusion arises from the fact that FARM expects/supports data to already be divided into pre-defined train/dev/test sets, but those divisions are usually just used in the final step and in the context of paper writing. Algorithm selection, hyperparameter optimization, architecture search and final deployment need to be much more flexible in how to manage data into "pure unseen" and "used-up" parts anyways.

@PhilipMay
Copy link
Contributor Author

PhilipMay commented Aug 31, 2020

Ok - maybe we should talk about what I need (want) to do and not about what is wrong or right.

UseCase 1:

I have two datasets. A big one and a test set.
I want to use the test set (and nothing else) to test my final model / my final hyperparameters at the end.
I want to use the big set for training and hyperparameter search.
Because the big set is still small and because I want to avoid
overfitting I want to use crossvalidation.

So I want to use FARM to split my big dataset into train and dev
and do cross validation for hyperparameter search.

Train is used for training and dev for EarlyStopping and to evaluate the model each
evaluate_every steps.

Because the CV is implemented as it is now in FARM (described several times above) I
can not split the big dataset to test and dev. The current implementation
always splits the data I give to DataSiloForCrossVal into train, dev and test.
If you do not agree on this please have a look into the code linked in my first post on top.

I do not want to split it to 3 datasets but only to 2 (train and dev). This current
way FARM "steals" data I do not want to give away.

2nd UseCase

In early stage of a project where I only have a low number of hand labeled data I only want
to do crossvalidation on all data I have. I want to skip a test set. This way I get an implression
of my model quality and can do active learning.

This usecase is also not covered by the current implementation in FARM.

To get to schema 1 with current FARM you can set test_filename=None.

That is not right. If I do that the data I give to DataSiloForCrossVal
is still split into train, dev and test (which is what I do not want).

@johann-petrak
Copy link
Contributor

@PhilipMay if I understand you correctly you want to use a single set as both dev and validation set for each fold? I assume you mean that for each split, one would create a training set and a validation set and then use the validation set both as dev set and for estimation (which already happens for the early stopping decision so no additional validation step would actually be needed)?

This is definitely not a proper way to do performance estimation since what you get will be biased and I think it should not get called "crossvalidation estimate". I can see that it might be tempting to try this in practice if very short on data but I am not sure if it should be part of the library as such.

@PhilipMay
Copy link
Contributor Author

PhilipMay commented Sep 1, 2020

@PhilipMay if I understand you correctly you want to use a single set as both dev and validation set for each fold? I assume you mean that for each split, one would create a training set and a validation set and then use the validation set both as dev set and for estimation (which already happens for the early stopping decision so no additional validation step would actually be needed)?

This is definitely not a proper way to do performance estimation since what you get will be biased and I think it should not get called "crossvalidation estimate". I can see that it might be tempting to try this in practice if very short on data but I am not sure if it should be part of the library as such.

Hmm - somehow we relly seem to have a huge gap in our understanding of different things. For me dev and validation set is absolutly the same they are synonyms for me.

And these 2 references seem have the same understanding:

and here

@johann-petrak So you say I need 4 sets right? Train, test, dev and validation? I never heared about that in the literature. Could you eventually provide a link?

  • train to train
  • test to test at the end of cross validation
  • validation to test each single fold
  • dev to do early stopping

And you say that when you do early stopping and evaluation on the same set you overfit that set on each fold and so not get representative results?

Is that what you mean?

@PhilipMay
Copy link
Contributor Author

PhilipMay commented Sep 1, 2020

@johann-petrak in case you want to do early stopping and validation on different datasets wouldnt it be better to do a nested cross validation? Especially if you have a small dataset?

@PhilipMay
Copy link
Contributor Author

@johann-petrak and @Timoeller since I agree to the fact that it is dangerous to do evaluation and early stopping on the same dataset I added the possibility to do nested cross validation. Can you please have a look at PR #508

Thanks
Philip

@johann-petrak
Copy link
Contributor

There is no universal definition of terms here unfortunately, remember that many classical algorithms do not need a dev set at all and remember that historically especially in statistics, there was no practice of handing around datasets with a fixed "test" or even "dev" set at all.

It helps to point out which model we are talking about: the result of xval is NOT a model, but an estimate because in order to get that estimate, k models have been trained and evaluated. To train and evaluate each of these models we use the same approach which we would use if we just did this once: train on the training set and evaluate on the per-fold "test" set which is usually called evaluation or validation set in the context of xval.
In order to train, some algorithms, like neural networks need a dev set. In principle, this should actually be something that is totally transparent in the training process: the algorithms trains on the training set, if the algorithm needs a dev set, it is up to the algorithm how exactly to do split that off the training data.

Now when we do architecture search, hyperparameter search and all that things become more complicated and we may indeed need to do nested estimation. I just think that the scenarios where where and how you need to do nested evaluations are so diverse that it would be good to think a bit more about the exact way for how to do that and especially combine it with 1) other estimation strategies like hold-out estimates and 2) the ability to have some way to also define the strategy of how to use those estimates, e.g. through greedy or beam-search hyperparameter estimation etc.

I am just a random outside contributor to FARM myself, so I will leave all that and also the decision of whether or how to use the suggested PR to the FARM developers.

@PhilipMay
Copy link
Contributor Author

@johann-petrak thanks for your feedback again. Now I understand why you do this cross validation the way you do.
In this context - what do you think about PR #508 . I think this would also enable a clean workflow - right?

An other thought.

FARM forces the user to load it's data from disk. There is no way to just give different datasets (numpy or pandas) to it. That way the user could just decide which way of cross validation the user wants to do. Leave one out, nested, all fancy stuff. If FARM would open up to just pass in different datasets we would not be forced to do crossvalidation inside FARM but you could code it around / outside of FARM.

@Timoeller
Copy link
Contributor

I hope we had more "random outside contributors" like you @johann-petrak
and I am also glad that you joined the discussion with @PhilipMay on this topic.

We took the discussion offline (in a remote call of course) and are aligned about the type of CV Johan implemented and its correctness. We leave the code as is.

As mentioned by Philip, we need a way to handle the (sometimes) strong variance that come from multiple model restarts. Philip is implementing a solution in #508

Closing this PR now, though I hope new generations will be able to learn something from the discussion recorded here : )

@Timoeller Timoeller closed this Sep 1, 2020
@PhilipMay PhilipMay deleted the new_xval branch September 6, 2020 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants