Support stratify in TrainTestSplit() API

Afaik, there's now way in ML.NET to split an original datasetset and create the two train/test datasets that are both balanced based on the LABEL/TARGET-CLASS or any other column (STRATIFICATION COLUMN). Am I right? 

I think this scenario is important so it is a lot easier to create balanced datasets that will provide a more reliable metrics when evaluating/testing a model.

**Currently in Scikit-Learn:**

For instance, in  **ScitKit-Learn** you can do stratified sampling by splitting one data set so that each split are similar with respect to something. In a classification setting, it is often chosen to ensure that the train and test sets have approximately the same percentage of samples of each target class as the complete set.

This can be done in **ScitKit-Learn** with the stratify argument from **train_test_split()** where you can specify any column:

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
				
```
stratify : array-like or None (default=None)
If not None, data is split in a stratified fashion, using this as the class labels.
```
	
**Currently in ML.NET:**
				
In ML.NET in the [TrainTestSplit() API](https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml.dataoperationscatalog.traintestsplit?view=ml-dotnet) we have the **samplingKeyColumnName**, but that's kind of the opposite to 'Stratification column': 

_Name of a column to use for **grouping rows**. If two examples share the same value of the samplingKeyColumnName, they are guaranteed to appear in the same subset (train or test). This can be used to ensure no label leakage from the train to the test set. If null no row grouping will be performed._

It would be a good improvement for ML.NET to support a **stratify** feature in the [TrainTestSplit() API](https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml.dataoperationscatalog.traintestsplit?view=ml-dotnet)

RELATED ISSUES:

https://github.com/dotnet/machinelearning/issues/2536 (In reality we didn't have stratification column, that was a wrong name. It was the current samplingKeyColumnName)

https://github.com/dotnet/machinelearning/issues/1204 (Here Pete was wrong by calling stratification column to a sampling Key Column Name or 'GroupPreservationColumn')

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support stratify in TrainTestSplit() API #4082

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support stratify in TrainTestSplit() API #4082

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions