New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cross-Validation API for v1.0 #2487

Open
rogancarr opened this Issue Feb 10, 2019 · 9 comments

Comments

Projects
4 participants
@rogancarr
Copy link
Contributor

rogancarr commented Feb 10, 2019

Hi All,

As we approach v1.0, I thought it might be nice to look at the API for cross-validation. Currently, our cross-validation API takes the inputs:

IDataView data; // training data
IEstimator<ITransformer> estimator; // Model to fit
int numFolds; //Number of folds to make
string labelColumn; // The label
string stratificationColumn; // The column to stratify on
seed; // The seed

and returns an array of

RegressionMetrics metrics;
ITransformer model;
IDataView scoredTestData;

with one entry for each fold.

I have a few questions:

  1. Are we happy with the outputs?
    I'm not overly concerned with these, but it will be hard to make this list smaller as we go.
  2. Do we need to specify labelColumn?
    Isn't there a way to get the label from the model? Making this explicit means that we are allowing the learner and the CV metrics to utilize different labels.
  3. Are we using the right terminology for stratification?
    Stratification usually means that ratios of classes are maintained across splits (see stratified sampling on wikipedia). Here, stratification means that items with the same value are clumped into the same split. The former makes sense if you want to maintain class ratios, especially with highly imbalanced classes, while the latter is useful for things like ranking (e.g. groupIds) or where leakage due to something like ordering may be a concern.

@rogancarr rogancarr added the question label Feb 10, 2019

@rogancarr rogancarr added this to To do in Project 13 via automation Feb 10, 2019

@rogancarr

This comment has been minimized.

Copy link
Contributor Author

rogancarr commented Feb 10, 2019

For item no. 3, I suggest the following:

  • Instead of Stratification, call it something like Grouping or RetainGroups, so CV would have the option GroupingColumn or RetainGroupsColumn.
  • Add support for ratio-balancing "stratification" in v1.1.
@rogancarr

This comment has been minimized.

Copy link
Contributor Author

rogancarr commented Feb 10, 2019

@shauheen @glebuk @TomFinley What do you think?

@justinormont

This comment has been minimized.

Copy link
Member

justinormont commented Feb 10, 2019

For #1:

  • I'd like to see aggregate metrics returned.
  • Is scoredTestData lazy? Or cached from the metrics calculation? If cached, this reduces our scalability.

Another question:

  1. Can we validate values in the `Stratification` column?
    It's too easy for users to shoot themselves in the foot. -- also in v1.1
@justinormont

This comment has been minimized.

Copy link
Member

justinormont commented Feb 10, 2019

More details for my added #4...

We should emit an error/warning when the user tries to use CV w/ a Stratification column which includes numbers not between 0 & 1. I ran into the issue w/ a ranking dataset.

I think our current handling per datatype is:

  • text : Run thru hashing, then evenly divide the hashspace [0...2hashbits] into the folds of [0...2hashbits/k), ...
  • boolean : unknown to me
  • I4 : unknown to me
  • key : Evenly divide the key space [0..N] in to the folds of [0, N/k), [n/k, 2*N/k), ...
    • Downside here, is if this was generated from a term transform on a column. **the semi-bad part** The larger/more-common groups are found sooner, and get placed as lower numbers in the key space, whereas uncommon groups are found last and are assigned the last key values. The user gets no warning of the produced imbalanced folds.
  • R4 : Blindly assume the space is uniformly distributed over [0..1] **the bad part**
    • Downside here, is if the value is >1. The user gets no warning that all rows will go into one CV fold.

We may also want to warn when the CV fold sizes are rather unbalanced, or worse when one fold has no data.

@TomFinley

This comment has been minimized.

Copy link
Contributor

TomFinley commented Feb 11, 2019

More specifically, perhaps, let's consider the actual signature:

public (RegressionMetrics metrics, ITransformer model, IDataView scoredTestData)[] CrossValidate(
IDataView data, IEstimator<ITransformer> estimator, int numFolds = 5, string labelColumn = DefaultColumnNames.Label,
string stratificationColumn = null, uint? seed = null)

An easily fixed problem is that the transformer type is not generic on the type of transformer returned. This has a practical effect... let's imagine you feed in SDCA or AP... surprise! You won't be able to know that the result is a linear model, because this method was not generic on the transformer type. If it had a generic parameter TTransformer with the restriction where TTransformer : ITransformer, then it would be right in the type, you'd get sensible intellisense, etc. (Practically this type will be auto after to be inferred I believe.) (Note that this also means that the return value will be generic on that same type.)

The most serious problem is the return value. 😄 Returning an array of value-tuples is not acceptable. Consider this comment from @justinormont ...

  • I'd like to see aggregate metrics returned.

Now, yeah, sure, we won't get around to this for v1, but by this being an array, it would be impossible to add it as post v1. Rather, the return value should be some single object out of which you can get the per fold results. We can always add more things (like aggregate metrics) to the object later, but if it is an array, we simply cannot.

Further, those per fold results should not be a value-tuple, for precisely the same reason that it is impossible to add more to them later without it being a breaking change.

(More generally: someone should followup and make sure we don't use value-tuples in our public surface of our API. They are currently quite problematic for F# and the tooling in VS is currently poor.)

Note that there are multiple of these methods, usually one per each of the major training catalogs. You mentioned the regression one.

@Ivanidzo4ka

This comment has been minimized.

Copy link
Member

Ivanidzo4ka commented Feb 11, 2019

The most serious problem is the return value. 😄

Considering what we are heavily investing in documentation right now. We currently putting this code in our samples.

// Leave out 10% of data for testing
            var (trainData, testData) = mlContext.BinaryClassification.TrainTestSplit(data, testFraction: 0.1);

I don't mind to go through code and clean this stuff, but first we need to agree what we need to do

Shall we have

public struct TrainTestSplit
        {
            public readonly IDataView TrainSet;
            public readonly IDataView TestSet;
        }

or we prefer other options?

@rogancarr

This comment has been minimized.

Copy link
Contributor Author

rogancarr commented Feb 11, 2019

@Ivanidzo4ka Those look good to me.

@rogancarr

This comment has been minimized.

Copy link
Contributor Author

rogancarr commented Feb 11, 2019

@TomFinley What about Stratification? That parameter name is super confusing for me, as this isn't usually what people mean when they say "stratification".

@TomFinley

This comment has been minimized.

Copy link
Contributor

TomFinley commented Feb 13, 2019

Stratification as you point out is a bad name. Indeed what we do is somewhat the opposite of stratification, in a way. The suggestion of "group" though by itself is not the best, since we already use that to identify actual groups.

So what we're trying to do with this is identify sets of items where, if they were to be split into the training and test set, would represent a form of label leakage. (Certainly groups as we see them in ranker training represents one important case of this.) So maybe "co-dependency ID." 😄 No but seriously, not that, though I agree it should be renamed @rogancarr.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment