New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation training isn't supported by the new API #2502

Closed
rogancarr opened this Issue Feb 11, 2019 · 6 comments

Comments

Projects
None yet
3 participants
@rogancarr
Copy link
Contributor

rogancarr commented Feb 11, 2019

I cannot find a way to train with a validation set in the new API, and I think there is some confusion as to how this would work, were it possible.

Take this example:

// Create a pipeline to train on the sentiment data
var trainData = readTrainData();
var validData = readValidData();

var pipeline = mlContext.Transforms.SomTransform.()
    .AppendCacheCheckpoint(mlContext) as IEstimator<ITransformer>;
var preprocessor = pipeline.Fit(trainData);
var preprocessedValidData = preprocessor.Transform(validData);

// Train model with validation set.
// There is no way below to specify a validation set for the learner
pipeline = pipeline.Append(mlContext.Regression.Trainers.FastTree(numTrees: 2));
// Nor is there a way to specify a validation set in the Fit
var model = pipeline.Fit(trainData);

This example uses FastTree, but the same problem exists for GAM and FFM, our other validation-set trainers: There is nowhere to specify a validation set.

A second problem exists, namely: If we were to specify a validaiton set, would we specify the raw, unprocessed validation set (here validData) or would we specify the pre-transformed validation set (here, preprocessedValidData).

@Ivanidzo4ka

This comment has been minimized.

Copy link
Member

Ivanidzo4ka commented Feb 11, 2019

var model = mlContext.Regression.Trainers.FastTree(numTrees: 2).Train(trainData, preprocessedValidData)

@TomFinley

This comment has been minimized.

Copy link
Contributor

TomFinley commented Feb 11, 2019

Why Train, not Fit? This would be more consistent, make them easier to find, etc.?

But regarding the more fundamnetal question, so as @Ivanidzo4ka points out, your error is here:

// There is no way below to specify a validation set for the learner
pipeline = pipeline.Append(mlContext.Regression.Trainers.FastTree(numTrees: 2));
// Nor is there a way to specify a validation set in the Fit
var model = pipeline.Fit(trainData);

Pipelines are just IEstimator implementors composed of a sequence of IEstimator. The IEstimator interface is deliberately very simple and specific. If you want to use any facility specific to a particular IEstimator implementor, that's perfectly fine, but not once you add it to a pipeline. You can still compose whatever little pipe you like prior to the point where you're training, train it on the training set, apply it to the validation dataset, then do the training with validation.

See the following sample, it might guide somewhat how to think of it, you'll note that it follows the path I describe -- you have an initial fitting to get the preprocessing transformer, then you fit it to the training set, apply it to the validation set, etc. etc.:

public void TrainWithValidationSet()
{
var ml = new MLContext(seed: 1, conc: 1);
// Pipeline.
var data = ml.Data.ReadFromTextFile<SentimentData>(GetDataPath(TestDatasets.Sentiment.trainFilename), hasHeader: true);
var pipeline = ml.Transforms.Text.FeaturizeText("Features", "SentimentText");
// Train the pipeline, prepare train and validation set.
var preprocess = pipeline.Fit(data);
var trainData = preprocess.Transform(data);
var validDataSource = ml.Data.ReadFromTextFile<SentimentData>(GetDataPath(TestDatasets.Sentiment.testFilename), hasHeader: true);
var validData = preprocess.Transform(validDataSource);
// Train model with validation set.
var trainer = ml.BinaryClassification.Trainers.FastTree("Label","Features");
var model = trainer.Train(ml.Data.Cache(trainData), ml.Data.Cache(validData));
}

@Ivanidzo4ka

This comment has been minimized.

Copy link
Member

Ivanidzo4ka commented Feb 11, 2019

Sure. All I wanted to say, it's possible, it just can't be part of IEstimatorChain, it should be separate call on trainer itself.

@rogancarr

This comment has been minimized.

Copy link
Contributor Author

rogancarr commented Feb 11, 2019

@Ivanidzo4ka @TomFinley Ah, I thought I was missing something — thanks for pointing it out.

New question: How do I save my pipeline out for scoring now that it's in different pieces?

  • The featurization pipeline
  • The training pipeline

And what do we do if this is just one part of a larger pipeline?

Do I then have to stitch it together by hand before I can save it? (If this is possible?)

@rogancarr

This comment has been minimized.

Copy link
Contributor Author

rogancarr commented Feb 11, 2019

To answer my own question, we can do this programmatically:

// Preprocess the datasets
var preprocessor = pipeline.Fit(trainData);
var preprocessedTrainData = preprocessor.Transform(trainData);
var preprocessedValidData = preprocessor.Transform(validData);

// Train the model with a validation set
var trainedModel = mlContext.Regression.Trainers.FastTree(numTrees: 2)
    .Train(trainData: preprocessedTrainData, validationData: preprocessedValidData);

// Combine the model
var model = preprocessor.Append(trainedModel);

I think this is probably okay for a V1.0 release, but it would be nicer to have a formulation of a training pipeline as a DAG, where we can have multiple inputs (e.g. one or more training sets, validation sets, etc.).

@rogancarr rogancarr closed this Feb 11, 2019

@TomFinley

This comment has been minimized.

Copy link
Contributor

TomFinley commented Feb 12, 2019

I think this is probably okay for a V1.0 release, but it would be nicer to have a formulation of a training pipeline as a DAG, where we can have multiple inputs (e.g. one or more training sets, validation sets, etc.).

We already have a DAG. You have variables with values you can assign, pass those variables into methods, and so forth. The good news is, it's actually a lot more powerful than a DAG... you cans have loops, write your own classes to encapsulate functionality... 😄

Seriously, the fact that you answered your own question with the conclusion being that you had to remember you were operating in C# and could therefore assign variables, call methods, and so on, is absolutely the right answer to this question. Frankly, I think it would be kind of silly for us to re-invent such basic concepts for this API. That usage of our C# .NET API for machine learning winds up looking like C# code is good, not bad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment