-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure no classes are missing from training/validation splits during data splitting #760
Comments
We could split the warning into its own issue and prioritize that, then circle back later to add retrying or class-balanced sampling. |
re: Class-Balanced Sampling. can we use sklearn's stratify options? e.g https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html that is a class, but other methods have stratify as an option: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html#sklearn.model_selection.train_test_split |
@kmax12 it appears we already use I think this issue should track:
It is possible we can do a better job of catching these edge-cases with a custom implementation than with sklearn's implementation. But that may not be needed for this issue. I'd want us to first use the sklearn classes as discussed and if we run into trouble we can reassess. |
@jeremyliweishih FYI I think we forgot to include this in the data splitting epic |
Problem
It's possible that random splitting could produce a dataset where a class is missing from the training and/or validation data, either during CV splitting or train-validation (TVH) splitting. This is more likely to occur for a minority class when there's a class imbalance.
Proposed Solutions
Detection: A basic protection is to check after data splitting and see if there's any missing classes on either side, and if so log a warning or error. This could be done easily.
Retry: If we detect this situation, we could also try the random data splitting again, up to a max number of times (perhaps 10).
Class-Balanced Sampling i.i.d. distribution: We could also write our own data splitting implementation which ensures that the distribution of labels across different splits is as similar as possible. We could do this by splitting each unique label's rows into two splits, then unifying the two sets across all the label types.
The text was updated successfully, but these errors were encountered: