Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure no classes are missing from training/validation splits during data splitting #760

Closed
dsherry opened this issue May 8, 2020 · 5 comments · Fixed by #1226
Closed
Assignees
Labels
enhancement An improvement to an existing feature.

Comments

@dsherry
Copy link
Contributor

dsherry commented May 8, 2020

Problem
It's possible that random splitting could produce a dataset where a class is missing from the training and/or validation data, either during CV splitting or train-validation (TVH) splitting. This is more likely to occur for a minority class when there's a class imbalance.

Proposed Solutions
Detection: A basic protection is to check after data splitting and see if there's any missing classes on either side, and if so log a warning or error. This could be done easily.

Retry: If we detect this situation, we could also try the random data splitting again, up to a max number of times (perhaps 10).

Class-Balanced Sampling i.i.d. distribution: We could also write our own data splitting implementation which ensures that the distribution of labels across different splits is as similar as possible. We could do this by splitting each unique label's rows into two splits, then unifying the two sets across all the label types.

@dsherry dsherry added the enhancement An improvement to an existing feature. label May 8, 2020
@dsherry
Copy link
Contributor Author

dsherry commented May 8, 2020

Related to: #194 #457

@dsherry
Copy link
Contributor Author

dsherry commented May 8, 2020

We could split the warning into its own issue and prioritize that, then circle back later to add retrying or class-balanced sampling.

@kmax12
Copy link
Contributor

kmax12 commented May 11, 2020

@dsherry
Copy link
Contributor Author

dsherry commented May 11, 2020

@kmax12 it appears we already use StratifiedKFold by default! So that's helpful.

I think this issue should track:

  • Add solid test coverage of class imbalance in binary and multiclass cases.
  • Add a class-imbalance data check which warns if class imbalance is so severe that data splitting would be affected.
  • Add an internal check which throws an error if we generate a training split which is missing a class
  • For CV splitting, make sure we understand and can describe how sklearn's StratifiedKFold works when it comes to edge-cases
  • Figure out what we'd use to get a stratified TVH split (StratifiedShuffleSplit)

It is possible we can do a better job of catching these edge-cases with a custom implementation than with sklearn's implementation. But that may not be needed for this issue. I'd want us to first use the sklearn classes as discussed and if we run into trouble we can reassess.

@dsherry dsherry added this to the September 2020 milestone Aug 7, 2020
@dsherry
Copy link
Contributor Author

dsherry commented Sep 11, 2020

@jeremyliweishih FYI I think we forgot to include this in the data splitting epic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement to an existing feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants