-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Afaik, there's now way in ML.NET to split an original datasetset and create the two train/test datasets that are both balanced based on the LABEL/TARGET-CLASS or any other column (STRATIFICATION COLUMN). Am I right?
I think this scenario is important so it is a lot easier to create balanced datasets that will provide a more reliable metrics when evaluating/testing a model.
Currently in Scikit-Learn:
For instance, in ScitKit-Learn you can do stratified sampling by splitting one data set so that each split are similar with respect to something. In a classification setting, it is often chosen to ensure that the train and test sets have approximately the same percentage of samples of each target class as the complete set.
This can be done in ScitKit-Learn with the stratify argument from train_test_split() where you can specify any column:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
stratify : array-like or None (default=None)
If not None, data is split in a stratified fashion, using this as the class labels.
Currently in ML.NET:
In ML.NET in the TrainTestSplit() API we have the samplingKeyColumnName, but that's kind of the opposite to 'Stratification column':
Name of a column to use for grouping rows. If two examples share the same value of the samplingKeyColumnName, they are guaranteed to appear in the same subset (train or test). This can be used to ensure no label leakage from the train to the test set. If null no row grouping will be performed.
It would be a good improvement for ML.NET to support a stratify feature in the TrainTestSplit() API
RELATED ISSUES:
#2536 (In reality we didn't have stratification column, that was a wrong name. It was the current samplingKeyColumnName)
#1204 (Here Pete was wrong by calling stratification column to a sampling Key Column Name or 'GroupPreservationColumn')