Skip to content

Support stratify in TrainTestSplit() API #4082

@CESARDELATORRE

Description

@CESARDELATORRE

Afaik, there's now way in ML.NET to split an original datasetset and create the two train/test datasets that are both balanced based on the LABEL/TARGET-CLASS or any other column (STRATIFICATION COLUMN). Am I right?

I think this scenario is important so it is a lot easier to create balanced datasets that will provide a more reliable metrics when evaluating/testing a model.

Currently in Scikit-Learn:

For instance, in ScitKit-Learn you can do stratified sampling by splitting one data set so that each split are similar with respect to something. In a classification setting, it is often chosen to ensure that the train and test sets have approximately the same percentage of samples of each target class as the complete set.

This can be done in ScitKit-Learn with the stratify argument from train_test_split() where you can specify any column:

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

stratify : array-like or None (default=None)
If not None, data is split in a stratified fashion, using this as the class labels.

Currently in ML.NET:

In ML.NET in the TrainTestSplit() API we have the samplingKeyColumnName, but that's kind of the opposite to 'Stratification column':

Name of a column to use for grouping rows. If two examples share the same value of the samplingKeyColumnName, they are guaranteed to appear in the same subset (train or test). This can be used to ensure no label leakage from the train to the test set. If null no row grouping will be performed.

It would be a good improvement for ML.NET to support a stratify feature in the TrainTestSplit() API

RELATED ISSUES:

#2536 (In reality we didn't have stratification column, that was a wrong name. It was the current samplingKeyColumnName)

#1204 (Here Pete was wrong by calling stratification column to a sampling Key Column Name or 'GroupPreservationColumn')

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Priority of the issue for triage purpose: Needs to be fixed at some point.enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions