Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Cross-Validation API for v1.0 #2487
As we approach v1.0, I thought it might be nice to look at the API for cross-validation. Currently, our cross-validation API takes the inputs:
IDataView data; // training data IEstimator<ITransformer> estimator; // Model to fit int numFolds; //Number of folds to make string labelColumn; // The label string stratificationColumn; // The column to stratify on seed; // The seed
and returns an array of
RegressionMetrics metrics; ITransformer model; IDataView scoredTestData;
with one entry for each fold.
I have a few questions:
More details for my added
We should emit an error/warning when the user tries to use CV w/ a
I think our current handling per datatype is:
We may also want to warn when the CV fold sizes are rather unbalanced, or worse when one fold has no data.
More specifically, perhaps, let's consider the actual signature:
An easily fixed problem is that the transformer type is not generic on the type of transformer returned. This has a practical effect... let's imagine you feed in SDCA or AP... surprise! You won't be able to know that the result is a linear model, because this method was not generic on the transformer type. If it had a generic parameter
The most serious problem is the return value.
Now, yeah, sure, we won't get around to this for v1, but by this being an array, it would be impossible to add it as post v1. Rather, the return value should be some single object out of which you can get the per fold results. We can always add more things (like aggregate metrics) to the object later, but if it is an array, we simply cannot.
Further, those per fold results should not be a value-tuple, for precisely the same reason that it is impossible to add more to them later without it being a breaking change.
(More generally: someone should followup and make sure we don't use value-tuples in our public surface of our API. They are currently quite problematic for F# and the tooling in VS is currently poor.)
Note that there are multiple of these methods, usually one per each of the major training catalogs. You mentioned the regression one.
Considering what we are heavily investing in documentation right now. We currently putting this code in our samples.
I don't mind to go through code and clean this stuff, but first we need to agree what we need to do
Shall we have
or we prefer other options?
referenced this issue
Feb 11, 2019
Stratification as you point out is a bad name. Indeed what we do is somewhat the opposite of stratification, in a way. The suggestion of "group" though by itself is not the best, since we already use that to identify actual groups.
So what we're trying to do with this is identify sets of items where, if they were to be split into the training and test set, would represent a form of label leakage. (Certainly groups as we see them in ranker training represents one important case of this.) So maybe "co-dependency ID."