Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Validation training isn't supported by the new API #2502
I cannot find a way to train with a validation set in the new API, and I think there is some confusion as to how this would work, were it possible.
Take this example:
// Create a pipeline to train on the sentiment data var trainData = readTrainData(); var validData = readValidData(); var pipeline = mlContext.Transforms.SomTransform.() .AppendCacheCheckpoint(mlContext) as IEstimator<ITransformer>; var preprocessor = pipeline.Fit(trainData); var preprocessedValidData = preprocessor.Transform(validData); // Train model with validation set. // There is no way below to specify a validation set for the learner pipeline = pipeline.Append(mlContext.Regression.Trainers.FastTree(numTrees: 2)); // Nor is there a way to specify a validation set in the Fit var model = pipeline.Fit(trainData);
This example uses
A second problem exists, namely: If we were to specify a validaiton set, would we specify the raw, unprocessed validation set (here
But regarding the more fundamnetal question, so as @Ivanidzo4ka points out, your error is here:
// There is no way below to specify a validation set for the learner pipeline = pipeline.Append(mlContext.Regression.Trainers.FastTree(numTrees: 2)); // Nor is there a way to specify a validation set in the Fit var model = pipeline.Fit(trainData);
Pipelines are just
See the following sample, it might guide somewhat how to think of it, you'll note that it follows the path I describe -- you have an initial fitting to get the preprocessing transformer, then you fit it to the training set, apply it to the validation set, etc. etc.:
New question: How do I save my pipeline out for scoring now that it's in different pieces?
And what do we do if this is just one part of a larger pipeline?
Do I then have to stitch it together by hand before I can save it? (If this is possible?)
To answer my own question, we can do this programmatically:
// Preprocess the datasets var preprocessor = pipeline.Fit(trainData); var preprocessedTrainData = preprocessor.Transform(trainData); var preprocessedValidData = preprocessor.Transform(validData); // Train the model with a validation set var trainedModel = mlContext.Regression.Trainers.FastTree(numTrees: 2) .Train(trainData: preprocessedTrainData, validationData: preprocessedValidData); // Combine the model var model = preprocessor.Append(trainedModel);
I think this is probably okay for a V1.0 release, but it would be nicer to have a formulation of a training pipeline as a DAG, where we can have multiple inputs (e.g. one or more training sets, validation sets, etc.).
We already have a DAG. You have variables with values you can assign, pass those variables into methods, and so forth. The good news is, it's actually a lot more powerful than a DAG... you cans have loops, write your own classes to encapsulate functionality...
Seriously, the fact that you answered your own question with the conclusion being that you had to remember you were operating in C# and could therefore assign variables, call methods, and so on, is absolutely the right answer to this question. Frankly, I think it would be kind of silly for us to re-invent such basic concepts for this API. That usage of our C# .NET API for machine learning winds up looking like C# code is good, not bad.