IGNITE-10480: [ML] Stacking for training and inference#5635
IGNITE-10480: [ML] Stacking for training and inference#5635artemmalykh wants to merge 62 commits intoapache:masterfrom
Conversation
# Conflicts: # modules/ml/src/main/java/org/apache/ignite/ml/dataset/PartitionDataBuilder.java # modules/ml/src/main/java/org/apache/ignite/ml/dataset/UpstreamTransformerChain.java # modules/ml/src/main/java/org/apache/ignite/ml/dataset/impl/local/LocalDatasetBuilder.java
# Conflicts: # modules/ml/src/main/java/org/apache/ignite/ml/dataset/PartitionDataBuilder.java # modules/ml/src/main/java/org/apache/ignite/ml/dataset/UpstreamTransformerChain.java # modules/ml/src/main/java/org/apache/ignite/ml/dataset/impl/local/LocalDatasetBuilder.java
# Conflicts: # modules/ml/src/main/java/org/apache/ignite/ml/clustering/kmeans/KMeansTrainer.java # modules/ml/src/main/java/org/apache/ignite/ml/dataset/impl/cache/util/ComputeUtils.java # modules/ml/src/main/java/org/apache/ignite/ml/knn/regression/KNNRegressionTrainer.java # modules/ml/src/main/java/org/apache/ignite/ml/math/functions/IgniteFunction.java # modules/ml/src/main/java/org/apache/ignite/ml/math/primitives/vector/VectorUtils.java # modules/ml/src/main/java/org/apache/ignite/ml/regressions/logistic/multiclass/LogRegressionMultiClassTrainer.java # modules/ml/src/main/java/org/apache/ignite/ml/svm/SVMLinearMultiClassClassificationTrainer.java # modules/ml/src/main/java/org/apache/ignite/ml/trainers/DatasetTrainer.java # modules/ml/src/test/java/org/apache/ignite/ml/TestUtils.java # modules/ml/src/test/java/org/apache/ignite/ml/trainers/BaggingTest.java
| * @param <X> Type of input and output of identity model. | ||
| * @return Model equivalent to identity function. | ||
| */ | ||
| public static <X> Model<X, X> identityModel() { |
There was a problem hiding this comment.
"No usages found in All Places"
There was a problem hiding this comment.
Okay, let's add it on demand.
| */ | ||
| void addSubmodel(Model<IS, IA> subMdl) { | ||
| submodels.add(subMdl); | ||
| subModelsLayer = subModelsLayer != null ? subModelsLayer.combine(subMdl, aggregatingInputMerger) |
There was a problem hiding this comment.
In my opinion such interface is not convenient. Combiner requires working with Monoids like List[Double], but use may expect working with Doubles.
There was a problem hiding this comment.
Yeah, got StackedVectorDatasetTrainer#addTrainerWithDoubleOutput for this which essentially lifts Double to Vector monoid (with concatenation as mappend).
| import org.apache.ignite.ml.math.primitives.vector.VectorUtils; | ||
| import org.apache.ignite.ml.trainers.DatasetTrainer; | ||
|
|
||
| public class StackedVectorModel<O, AM extends Model<Vector, O>, L> extends SimpleStackedModelTrainer<Vector, O, AM, L> { |
There was a problem hiding this comment.
Seems like this class in unnecessary, removed it completely.
| return new IgniteFunction<T, R>() { | ||
| /** {@inheritDoc} */ | ||
| @Override public String toString() { | ||
| return "Constant function [c=" + r + "]"; |
There was a problem hiding this comment.
it looks good but we don't use toString for functions)
There was a problem hiding this comment.
Yes, intention was to make debugging less painful with at least some of the lambdas listed with meaningful names, but maybe we should make a ticket for that.
| * @param <M1> Type of submodel trainer. | ||
| * @return This object. | ||
| */ | ||
| public <M1 extends Model<Vector, Double>> StackedVectorTrainer<O, AM, L> withAddedDoubleValuedTrainer( |
There was a problem hiding this comment.
-> addModelTrainerWithDoubleOutput
| */ | ||
| public <M1 extends Model<Vector, Double>> StackedVectorTrainer<O, AM, L> withAddedDoubleValuedTrainer( | ||
| DatasetTrainer<M1, L> trainer) { | ||
| return withAddedTrainer(AdaptableDatasetTrainer.of(trainer).afterTrainedModel(VectorUtils::num2Vec)); |
| } | ||
|
|
||
| /** | ||
| * Get a composition model of the form {@code after . mdl}. |
There was a problem hiding this comment.
you provide composition interface through andThen-function
"f1 andThen f2" works as "f2(f1(x))" but "f1 compose f2" as "f1(f2())"
In my opinion you should provide uniform semantic (preferably "andThen") in all comments to avoid problems of understanding and using of such interface
| submodelInput2AggregatingInputConverter); | ||
| } | ||
|
|
||
| @Override public SimpleStackedDatasetTrainer<I, O, AM, L> withAggregatorInputMerger(IgniteBinaryOperator<I> merger) { |
| return (SimpleStackedDatasetTrainer<I, O, AM, L>)super.withOriginalFeaturesDropped(); | ||
| } | ||
|
|
||
| @Override public SimpleStackedDatasetTrainer<I, O, AM, L> withOriginalFeaturesKept( |
| * </pre> | ||
| * During second step we can choose if we want to keep original features along with converted outputs of first layer | ||
| * models or use only converted results of first layer models. This choice will also affect inference. | ||
| * This class is a most abstract stacked trainer, there is a {@link StackedVectorDatasetTrainer}: a shortcut version of |
There was a problem hiding this comment.
I think "the most general stacked trainer" will be better
| * @param trainer Submodel trainer. | ||
| * @return This object. | ||
| */ | ||
| public <M1 extends Model<IS, IA>> StackedDatasetTrainer<IS, IA, O, AM, L> addTrainer( |
There was a problem hiding this comment.
it is just wrapper without additional logic
what is the reason of using it?
There was a problem hiding this comment.
Reason is to convert everything into DatasetTrainer<Model<IS, IA>, L> (in contrast to DatasetTrainer<? extends Model<IS, IA>, L>) to make work with the list of submodelTrainers less painful. This is unsafe conversion, but since we have control of all sumbodelTrainers list usages inside our class, it's IMHO a reasanoble tradeoff.
| } | ||
|
|
||
| /** {@inheritDoc} */ | ||
| @Override protected <K, V> StackedModel<IS, IA, O, AM> updateModel(StackedModel<IS, IA, O, AM> mdl, |
There was a problem hiding this comment.
maybe we should rewrite updateModel function from abstract form to form like this to avoid such code:
class DatasetTrainer {
....
protected ... updateModel(...) {
throw NotImplementedException()
}
....
}
?
There was a problem hiding this comment.
Hmm... We will avoid boilerplate, but seems like it will make code more error prone. Developer can forget to override this method for trainer which potentially supports updating and get an error while trying to update this model in the future whereas keeping it abstract forces developer to think if this trainer supports update and insert NotImplementedException more cautiously.
| ensemble -> { | ||
| int i = 0; | ||
| List<IgniteSupplier<Model<IS, IA>>> res = new ArrayList<>(); | ||
| for (Model<IS, IA> submodel : mdl.submodels()) { |
There was a problem hiding this comment.
May you merge update and fit logic?
|
|
||
| /** | ||
| * Type used to adapt input and output types of wrapped {@link DatasetTrainer}. | ||
| * Produces model which is composition of form {@code after . wMdl . before} where dot denotes functional composition |
There was a problem hiding this comment.
please avoid functional composition notation in Java world)
in this wild world there are a lot of non-functional programmers may to beat you))
There was a problem hiding this comment.
Ok, since we have andThen, I'll use it.
P.S. I will camouflage in a State monad to look imperativish
| IgniteFunction<IS, IA> submodelInput2AggregatingInputConverter, | ||
| IgniteFunction<IA, Vector> submodelOutput2VectorConverter, | ||
| IgniteFunction<Vector, IS> vector2SubmodelInputConverter) { | ||
| if (submodelInput2AggregatingInputConverter != null) |
There was a problem hiding this comment.
The reasons behind "2" are
- I saw in code some methods with such naming, and decided to be consistent
- "2" Is a good visual separator (at least for me) between "from" and "to" parts which (again, at least for me) rises readability,
| IgniteFunction<IA, Vector> submodelOutput2VectorConverter, | ||
| IgniteFunction<Vector, IS> vector2SubmodelInputConverter, | ||
| Vector v) { | ||
| return vector2SubmodelInputConverter.andThen(mdl).andThen(submodelOutput2VectorConverter).apply(v); |
| * Constructs instance of this class. | ||
| */ | ||
| public StackedVectorDatasetTrainer() { | ||
| this(null); |
There was a problem hiding this comment.
This is a shortcut specially for people who prefer to pass all arguments via withs.
And the one with aggregator trainer as a single param is for IDE automatic type inference when introducing new variable.
| * @param <M1> Type of submodel trainer model. | ||
| * @return This object. | ||
| */ | ||
| public <M1 extends Model<Matrix, Matrix>> StackedVectorDatasetTrainer<O, AM, L> addMatrix2MatrixTrainer( |
There was a problem hiding this comment.
This is shortcut "specially" for MLP :)
| * @param val Value to wrap. | ||
| * @return Specified value wrapped into vector. | ||
| */ | ||
| public static Vector num2Vec(double val) { |
There was a problem hiding this comment.
If you don't mind, I wouldn't change it now because otherwise for consistency I should also change other such methods. I think it should be done atomically after some discussion.
| * @param val Value to wrap in array. | ||
| * @return Number wrapped in 1-sized array. | ||
| */ | ||
| public static double[] num2Arr(double val) { |
| * Creates {@link DatasetTrainer} with same training logic, but able to accept labels of given new type | ||
| * of labels. | ||
| * | ||
| * @param new2Old Converter of new labels to old labels. |
There was a problem hiding this comment.
new20 old? newToOld or something better
|
|
||
| StackedModel<Vector, Vector, Double, LinearRegressionModel> mdl = trainer | ||
| .withAggregatorTrainer(new LinearRegressionLSQRTrainer().withConvertedLabels(x -> x * factor)) | ||
| .addTrainer(mlpTrainer) |
There was a problem hiding this comment.
StackedModel getting looks pretty, many thanks!
There was a problem hiding this comment.
Thanks! As for withTrainer, it was a tough choice: withTrainer looks like we substitute new trainer in place of the old trainer, not adding it. Previously it was withAddedTrainer, which looked a bit clumsy, so I went with addTrainer.
| final double factor = 3; | ||
|
|
||
| StackedModel<Vector, Vector, Double, LinearRegressionModel> mdl = trainer | ||
| .withAggregatorTrainer(new LinearRegressionLSQRTrainer().withConvertedLabels(x -> x * factor)) |
There was a problem hiding this comment.
ConvertedLabels seems like reinvention of labelExtractor sometimes
There was a problem hiding this comment.
Maybe we need a preprocessor like LabelTransformer with map function to change them
There was a problem hiding this comment.
This is for the case when we use one DatasetBuilder for many trainers like we do in StackedDatasetTrainer. In this case we want the ability to adapt each of the trainers to be able to work with this dataset, this method is just for that.
No description provided.