Three major concepts: Estimators, Transformers and Data #581

Zruty0 · 2018-07-25T02:39:15Z

This is still an incomplete proposal, but I played for a bit with what I had, and it looks promising to me so far.

The general idea is that we narrow our 'zoo' of components (transforms, predictors, scorers, loaders etc) down to three kinds:

The data. An IDataView with schema, like before.
The transformer. This is an object that can transform data and output data.

    public interface IDataTransformer
    {
        IDataView Transform(IDataView input);
        ISchema GetOutputSchema(ISchema inputSchema);
    }

The estimator. This is the 'trainer'. The object that can 'train' a transformer using data.

    public interface IDataEstimator
    {
        IDataTransformer Fit(IDataView input);
        SchemaShape GetOutputSchema(SchemaShape inputSchema);
    }

Obviously, a chain of transformers can itself behave as a transformer, and a chain of estimators can behave like estimators.

We also introduce a 'data reader' (and its estimator), responsible for bringing the data 'from outside' (think loaders):

    public interface IDataReader<TIn>
    {
        IDataView Read(TIn input);
        ISchema GetOutputSchema();
    }

    public interface IDataReaderEstimator<TIn>
    {
        IDataReader<TIn> Fit(TIn input);
        SchemaShape GetOutputSchema();
    }

Old component	New component
Data	Data
Transform	Transformer
Trainable transform (before it is trained)	Estimator
Trainable transform (after it is trained)	Transformer
Trainer	Estimator
Predictor	not sure yet. I'm thinking like 'a field of the scoring transformer?'
Scorer	Transformer
Untrainable loader	Data reader
Trainable loader	Estimator of data reader

I have gone through the motions of creating a 'pipeline estimator' and 'pipeline transformer' objects, which then allows me to write this code to train and test:

            var env = new TlcEnvironment();

            var pipeline = new EstimatorPipe<IMultiStreamSource>(new MyTextLoader(env, MakeTextLoaderArgs()));
            pipeline.Append(new MyConcatTransformer(env, "Features", "SepalLength", "SepalWidth", "PetalLength", "PetalWidth"))
                    .Append(new MyNormalizer(env, "Features"))
                    .Append(new MySdca(env));

            var model = pipeline.Fit(new MultiFileSource(@"e:\data\iris.txt"));

            IrisPrediction[] scoredTrainData = model.Transform(new MultiFileSource(@"e:\data\iris.txt"))
                .AsEnumerable<IrisPrediction>(env, reuseRowObject: false)
                .ToArray();

Here, the only catch is the 'MakeTextLoaderArgs', which is an obnoxiously long way to define the original schema of the text loader. But it is obviously subject to improvement.

The full 'playground' is available at https://github.com/Zruty0/machinelearning/tree/feature/estimators

The text was updated successfully, but these errors were encountered:

Zruty0 · 2018-07-25T03:05:23Z

And this is approximately how we can dissect the pipeline (before or after training) and recompose it.

Here, I'm going to strip out a loader and make it into a prediction engine.

            ITransformer<IMultiStreamSource> loader;
            IEnumerable<IDataTransformer> steps;
            (loader, steps) = model.GetParts();

            var engine = new MyPredictionEngine<IrisData, IrisPrediction>(env, loader.GetOutputSchema(), steps);
            IrisPrediction prediction = engine.Predict(new IrisData()
            {
                SepalLength = 5.1f,
                SepalWidth = 3.3f,
                PetalLength = 1.6f,
                PetalWidth = 0.2f,
            });

And this is how I can take out a normalizer, because I'm crazy:

            var bogusEngine = new MyPredictionEngine<IrisData, IrisPrediction>(env, loader.GetOutputSchema(), new[] { steps.First(), steps.Last() });
            IrisPrediction bogusPrediction = bogusEngine.Predict(new IrisData()
            {
                SepalLength = 5.1f,
                SepalWidth = 3.3f,
                PetalLength = 1.6f,
                PetalWidth = 0.2f,
            });

TomFinley · 2018-07-26T11:58:27Z

Hi @Zruty0 , this seems positive. The "estimator" logic is what we consider the "ideal" solution to #267. It also could be a declarative structure. I would like input from @interesaaat and @tcondie, if they can be persuaded to provide notes. Also separating out the conflation between model and data would avoid #580.

Strong typing seems like a problem in the current proposal. To take an example: a linear trainer produces a linear predictor (ITrainer<LinearPredictor> in the current system). So, if I do var pred = new Sdca(...).Train(...) I can, in a discoverable way through autocomplete, see that this pred has properties for the weights and bias. Or even something like getting the sequence of term mappings out of an instantiated TermTransform, the topics out of LDA, or anything like this. However, this proposal has things are merely estimators that return transforms of no particular type -- opaque black boxes, which is a serious problem people have with the current LearningPipeline API. This would be blocking, but fortunately I'm fairly certain you can resolve this issue with a covariant generic type on the output. (Similar to current arrangement between ITrainer and ITrainer<out TPredictor>, no doubt.)

alexdegroot · 2018-07-27T08:12:58Z

Regarding MakeTextLoaderArgs:
The options pattern is an option(https://docs.microsoft.com/en-us/aspnet/core/fundamentals/configuration/options?view=aspnetcore-2.1), but you can also consider simply adding an optional 'Action argsAction'-argument into the constructor.
Then you can pretty elegantly configure it:
new MyTextLoader(env, args => { args.ArgOne = 1; args.ArgTwo = "Yada" } );
Or leave it out:
new MyTextLoader(env);

Inside the constructor, you can do such a thing as which ensures that you'll always have an correctly initialized object to deal with for further processing:
var args = argsAction== null ? MakeTextLoaderArgs.Empty : args(new MakeTextLoaderArgs());

TomFinley · 2018-07-28T06:19:26Z

Oh that's an interesting idea @alexdegroot ... that way you don't have to expose the details of constructing this little object at all. Hmmm. Something about that is very appealing.

Zruty0 · 2018-07-30T16:27:22Z

Yes @alexdegroot , this sounds like a great idea to me.

I am a little bit suspicious of introducing yet another level of indirection into the API (data -> ITransformer that produces data -> IEstimator that produces ITransformer -> Options that are needed to construct IEstimator), but the args-mutating delegate seems easy and powerful.

alexdegroot · 2018-07-30T16:49:38Z

There's also a chance to do both, simply inject the object as argument or use the arg mutating delegate.
I'm not sure how strict you guys are on single ways of doing stuff.
There's the chance that you end up in the modularity disaster where ASP.NET Core 2 already is: one way works and is preferred in 1.0, another in 2.0 and a third is required to make things work in 2.1.

When it comes to consistency, I'd opt for a single way to produce Args across all these objects. If you want to do things fluent, then basically you should never have to leave your stack of calls. As bonus you can simply comment out a few lines.

eerhardt · 2018-08-07T02:56:11Z

This might be just a type-o, but what is the difference between ISchema and SchemaShape? I see that estimators operate over SchemaShape and transformers operate on ISchema. Is there a difference? Or are these supposed to be the same thing?

justinormont · 2018-08-07T10:17:00Z

Will the difference of estimators vs. transformers cause users to have to the know when a transform is trainable?

For example the TextTransform is trainable if using the dictionary method, but not when using hashing. I'm unsure how NAHandleTransform is coded, but simply replacing the default-value for the datatype doesn't need a trainable transform whereas replacing w/ the mean-value would.

Zruty0 · 2018-08-07T15:32:53Z

@eerhardt , it's not a typo. SchemaShape is a 'relaxed schema', holding some properties of the schema, but not all:

Concept	ISchema	SchemaShape
Column names	Yes	Yes
Column type	Exact	Exact type for scalars, also 'Vector' and 'VariableVector', but no vector size
Column index	Yes	No
Hidden columns	Yes	No
Metadata	Full	Names of metadata that is present, but no values

Zruty0 · 2018-08-07T15:38:44Z

@justinormont , generally speaking, yes.
For non-trainable transforms, they will map directly to Transformer objects, so there will be public constructors that create these transformers.

For both trainable and non-trainable, there will also be corresponding Estimator's that fit the corresponding transformers.

For example, if you try to instantiate a DictionaryTextTransformer, you will need to provide a dictionary of terms at construction time. If you don't have it, it's an indication that you need to instantiate a DictionaryTextTransformEstimator (naming is not finalized), that will learn the dictionary.

For HashTextTransformer, there will be a constructor that only accepts number of hash bits. The corresponding HashTextTransformEstimator will also have an 'invert hash' option, to associate hash buckets with encountered values.

eerhardt · 2018-08-07T16:12:16Z

SchemaShape is a 'relaxed schema', holding some properties of the schema, but not all

Do we need two completely separate Schema types for this? Would it be possible to use the same types, but not have all the information filled out for estimators?

It would be unfortunate if we had two parallel "schema" type graphs, and developers had to duplicate code to inspect/construct/etc the two different schema graphs we had.

Zruty0 · 2018-08-07T16:26:59Z

The current framework is separating the 'relaxed schema' into a separate collection. I don't really like it, and I would much rather have one, but it would be a lot of work to reconcile the two: mainly the existing schema handling code that somehow needs to define what it will do with a relaxed schema.

markusweimer · 2018-08-15T22:01:12Z

Regarding the two schema types: It seems to me that ISchema is a specialization of SchemaShape. We could represent this in a type hierarchy where ISchema extends SchemaShape. However, those names are probably all wrong. I'd look to @interesaaat for the proper databasy names for this.

eerhardt · 2018-11-30T17:03:53Z

@Zruty0 @TomFinley - how much work is left for this issue? Do you think this can be closed?

Zruty0 · 2018-12-18T22:55:32Z

Yep, I think we can close it.

Zruty0 added the API Issues pertaining the friendly API label Jul 25, 2018

Zruty0 added this to Proposed in API Proposals via automation Jul 25, 2018

Zruty0 moved this from Proposed to Approved in API Proposals Jul 25, 2018

Zruty0 mentioned this issue Jul 25, 2018

Proposal for Fluent API #474

Closed

Zruty0 moved this from Approved to Proposed in API Proposals Jul 25, 2018

Zruty0 moved this from Proposed to Experimenting in API Proposals Jul 25, 2018

This was referenced Jul 25, 2018

Proposal for Major Change in API #371

Closed

Full scope of API review #583

Closed

Zruty0 mentioned this issue Jul 30, 2018

Calculated Feature #595

Closed

This was referenced Jul 31, 2018

Added convenience constructors for ScoreTransform and TrainAndScoreTransform. #614

Merged

Direct API: Static Typing of Data Pipelines #632

Closed

Zruty0 mentioned this issue Aug 5, 2018

How to dump intermediate data in pipeline? #617

Closed

Zruty0 moved this from Experimenting to Finalized in API Proposals Aug 23, 2018

Zruty0 mentioned this issue Aug 27, 2018

New API for ML.NET #754

Closed

TomFinley mentioned this issue Sep 17, 2018

De-transformation of samplers, filters #933

Open

TomFinley mentioned this issue Sep 25, 2018

Microsoft.ML.TensorFlow is currently broken #1027

Closed

TomFinley mentioned this issue Oct 25, 2018

Trainer estimator cleanup for FastTrees and LightGBM #1352

Merged

TomFinley mentioned this issue Nov 26, 2018

Added Feature Contribution Calculation Transform #1677

Merged

eerhardt added this to To do in Project 13 via automation Nov 30, 2018

sfilipi mentioned this issue Dec 7, 2018

Suggestion related to Working with Columns #127

Closed

Zruty0 closed this as completed Dec 18, 2018

Project 13 automation moved this from To do to Done Dec 18, 2018

TomFinley mentioned this issue Jan 2, 2019

Internalize concepts of IDataTransform/Loader/TransformTemplate. #1995

Closed

TomFinley mentioned this issue Jan 14, 2019

Rename CreateTextReader to CreateTextLoader #2125

Merged

najeeb-kazmi mentioned this issue Jan 15, 2019

Decide a good name for TextLoader #2144

Closed

TomFinley mentioned this issue Feb 7, 2019

NimbusML's dot_export_pipeline for c# #2433

Closed

TomFinley mentioned this issue Feb 27, 2019

Text loader v.s in-memory data structure in API reference samples #2726

Closed

eerhardt mentioned this issue Mar 4, 2019

Chains of Chains #2820

Closed

TomFinley mentioned this issue Mar 11, 2019

Discussion: ColumnOptions actually a good name? #2884

Closed

TomFinley mentioned this issue Mar 26, 2019

Code documentation: Improve IEstimator/ITransformer/IDataView XML and high level docs #3096

Open

TomFinley mentioned this issue Apr 17, 2019

Relationship between SchemaShape from IEstimator and DataViewSchema from its ITransformer, and resulting fallout #3380

Closed

eerhardt mentioned this issue Jun 4, 2019

Forecasting model framework for time series. #1900

Merged

ghost locked as resolved and limited conversation to collaborators Mar 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Three major concepts: Estimators, Transformers and Data #581

Three major concepts: Estimators, Transformers and Data #581

Zruty0 commented Jul 25, 2018 •

edited

Loading

Zruty0 commented Jul 25, 2018

TomFinley commented Jul 26, 2018

alexdegroot commented Jul 27, 2018

TomFinley commented Jul 28, 2018

Zruty0 commented Jul 30, 2018

alexdegroot commented Jul 30, 2018 •

edited

Loading

eerhardt commented Aug 7, 2018

justinormont commented Aug 7, 2018

Zruty0 commented Aug 7, 2018 •

edited

Loading

Zruty0 commented Aug 7, 2018

eerhardt commented Aug 7, 2018

Zruty0 commented Aug 7, 2018

markusweimer commented Aug 15, 2018

eerhardt commented Nov 30, 2018

Zruty0 commented Dec 18, 2018

Three major concepts: Estimators, Transformers and Data #581

Three major concepts: Estimators, Transformers and Data #581

Comments

Zruty0 commented Jul 25, 2018 • edited Loading

Zruty0 commented Jul 25, 2018

TomFinley commented Jul 26, 2018

alexdegroot commented Jul 27, 2018

TomFinley commented Jul 28, 2018

Zruty0 commented Jul 30, 2018

alexdegroot commented Jul 30, 2018 • edited Loading

eerhardt commented Aug 7, 2018

justinormont commented Aug 7, 2018

Zruty0 commented Aug 7, 2018 • edited Loading

Zruty0 commented Aug 7, 2018

eerhardt commented Aug 7, 2018

Zruty0 commented Aug 7, 2018

markusweimer commented Aug 15, 2018

eerhardt commented Nov 30, 2018

Zruty0 commented Dec 18, 2018

Zruty0 commented Jul 25, 2018 •

edited

Loading

alexdegroot commented Jul 30, 2018 •

edited

Loading

Zruty0 commented Aug 7, 2018 •

edited

Loading