Ensure no classes are missing from training/validation splits during data splitting #760

dsherry · 2020-05-08T21:38:59Z

Problem
It's possible that random splitting could produce a dataset where a class is missing from the training and/or validation data, either during CV splitting or train-validation (TVH) splitting. This is more likely to occur for a minority class when there's a class imbalance.

Proposed Solutions
Detection: A basic protection is to check after data splitting and see if there's any missing classes on either side, and if so log a warning or error. This could be done easily.

Retry: If we detect this situation, we could also try the random data splitting again, up to a max number of times (perhaps 10).

Class-Balanced Sampling i.i.d. distribution: We could also write our own data splitting implementation which ensures that the distribution of labels across different splits is as similar as possible. We could do this by splitting each unique label's rows into two splits, then unifying the two sets across all the label types.

dsherry · 2020-05-08T21:40:04Z

Related to: #194 #457

dsherry · 2020-05-08T21:45:18Z

We could split the warning into its own issue and prioritize that, then circle back later to add retrying or class-balanced sampling.

kmax12 · 2020-05-11T10:15:07Z

re: Class-Balanced Sampling. can we use sklearn's stratify options? e.g https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html

that is a class, but other methods have stratify as an option: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split

dsherry · 2020-05-11T15:32:08Z

@kmax12 it appears we already use StratifiedKFold by default! So that's helpful.

I think this issue should track:

Add solid test coverage of class imbalance in binary and multiclass cases.
Add a class-imbalance data check which warns if class imbalance is so severe that data splitting would be affected.
Add an internal check which throws an error if we generate a training split which is missing a class
For CV splitting, make sure we understand and can describe how sklearn's StratifiedKFold works when it comes to edge-cases
Figure out what we'd use to get a stratified TVH split (StratifiedShuffleSplit)

It is possible we can do a better job of catching these edge-cases with a custom implementation than with sklearn's implementation. But that may not be needed for this issue. I'd want us to first use the sklearn classes as discussed and if we run into trouble we can reassess.

dsherry · 2020-09-11T20:52:50Z

@jeremyliweishih FYI I think we forgot to include this in the data splitting epic

dsherry added the enhancement An improvement to an existing feature. label May 8, 2020

dsherry mentioned this issue May 8, 2020

Add baseline models for a given dataset #746

Merged

dsherry added this to the September 2020 milestone Aug 7, 2020

bchen1116 self-assigned this Sep 24, 2020

bchen1116 mentioned this issue Sep 24, 2020

Internal target check to ensure no class missing from train/val #1226

Merged

bchen1116 closed this as completed in #1226 Sep 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure no classes are missing from training/validation splits during data splitting #760

Ensure no classes are missing from training/validation splits during data splitting #760

dsherry commented May 8, 2020 •

edited

Loading

dsherry commented May 8, 2020

dsherry commented May 8, 2020 •

edited

Loading

kmax12 commented May 11, 2020 •

edited

Loading

dsherry commented May 11, 2020

dsherry commented Sep 11, 2020

Ensure no classes are missing from training/validation splits during data splitting #760

Ensure no classes are missing from training/validation splits during data splitting #760

Comments

dsherry commented May 8, 2020 • edited Loading

dsherry commented May 8, 2020

dsherry commented May 8, 2020 • edited Loading

kmax12 commented May 11, 2020 • edited Loading

dsherry commented May 11, 2020

dsherry commented Sep 11, 2020

dsherry commented May 8, 2020 •

edited

Loading

dsherry commented May 8, 2020 •

edited

Loading

kmax12 commented May 11, 2020 •

edited

Loading