You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using your Label as your SamplingKeyColumn will cause all rows with the same Label value to be placed together in the same splits/folds (as you're seeing).
Description from docs:
SamplingKeyColumn:
Name of a column to use for grouping rows. If two examples share the same value of the samplingKeyColumnName, they are guaranteed to appear in the same subset (train or test). This can be used to ensure no label leakage from the train to the test set. Note that when performing a Ranking Experiment, the samplingKeyColumnName must be the GroupId column. If null no row grouping will be performed.
You are likely thinking of the related, but inverse, concept of Stratification where the rows are evenly represented between the splits/folds. Stratification has some downsides causing it be less helpful.
Param hover -- Ensure the hover description for SamplingKeyColumn in Visual Studio is well worded to explain the concept, and perhaps mention what it does not do.
Samples/main docs -- Further explain the concept of SamplingKeyColumn, why its useful, and also what it does not do.
In the param hover for SamplingKeyColumn in Visual Studio, mentioned above, we can also say to not put your Label column in SamplingKeyColumn.
It would be nice to automatically check this in TrainTestSplit, but the SamplingKeyColumn and Label aren't in the TrainTestSplit parameters together. The AutoML APIs can have this check as SamplingKeyColumn and Label are both in the parameters (and may want to throw an ArgumentException instead of warn).
The use
SamplingKeyColumnis rather confusing. Perhaps we can improve it with documentation and runtime checks/warnings.Originally posted by @justinormont in #5563 (comment)
Originally posted by @justinormont in #5563 (comment)
In the param hover for
SamplingKeyColumnin Visual Studio, mentioned above, we can also say to not put yourLabelcolumn inSamplingKeyColumn.It would be nice to automatically check this in
TrainTestSplit, but theSamplingKeyColumnandLabelaren't in theTrainTestSplitparameters together. The AutoML APIs can have this check asSamplingKeyColumnandLabelare both in the parameters (and may want to throw anArgumentExceptioninstead of warn).