Skip to content

Improve SamplingKeyColumn documentation and usability #5567

@justinormont

Description

@justinormont

The use SamplingKeyColumn is rather confusing. Perhaps we can improve it with documentation and runtime checks/warnings.

@tasmektep: In your sample, you're using the SamplingKeyColumn with your Label in it.

Using your Label as your SamplingKeyColumn will cause all rows with the same Label value to be placed together in the same splits/folds (as you're seeing).

Description from docs:

SamplingKeyColumn:
Name of a column to use for grouping rows. If two examples share the same value of the samplingKeyColumnName, they are guaranteed to appear in the same subset (train or test). This can be used to ensure no label leakage from the train to the test set. Note that when performing a Ranking Experiment, the samplingKeyColumnName must be the GroupId column. If null no row grouping will be performed.

https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml.dataoperationscatalog.traintestsplit?view=ml-dotnet

You are likely thinking of the related, but inverse, concept of Stratification where the rows are evenly represented between the splits/folds. Stratification has some downsides causing it be less helpful.

Originally posted by @justinormont in #5563 (comment)

@tasmektep: Keep posting issues that you run into. And thanks for posting your repro.

Work on ML․NET side:

  • Warning -- Have the splitter warn when zero rows are present in a split, with a special warning if SamplingKeyColumn is used. In the same fix, we could warn of unbalanced splits/folds to help Getting unablanced (or empty) folds when running CV with a SamplingKeyColumn (such as when running CV with Ranking) #3711. Down side is the user would need to attach a logger to see the warning.
  • Documentation
    • Param hover -- Ensure the hover description for SamplingKeyColumn in Visual Studio is well worded to explain the concept, and perhaps mention what it does not do.
    • Samples/main docs -- Further explain the concept of SamplingKeyColumn, why its useful, and also what it does not do.

Originally posted by @justinormont in #5563 (comment)

In the param hover for SamplingKeyColumn in Visual Studio, mentioned above, we can also say to not put your Label column in SamplingKeyColumn.

It would be nice to automatically check this in TrainTestSplit, but the SamplingKeyColumn and Label aren't in the TrainTestSplit parameters together. The AutoML APIs can have this check as SamplingKeyColumn and Label are both in the parameters (and may want to throw an ArgumentException instead of warn).

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationRelated to documentation of ML.NETup-for-grabsA good issue to fix if you are trying to contribute to the projectusabilitySmoothing user interaction or experience

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions