New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for Fluent API #474

Open
Andy-Wilkinson opened this Issue Jul 2, 2018 · 2 comments

Comments

2 participants
@Andy-Wilkinson

Andy-Wilkinson commented Jul 2, 2018

In this issue I describe a proposal for a fluent API for the building of ML.NET learning pipelines. This API would be consistent with existing .NET patterns such as LINQ, allowing people new to ML.NET to pick it up easily. It would allow clear concise code for simple scenarios, whilst allowing easy extension for more complex situations.

Background

The LearningPipeline API used by the current preview releases of ML.NET has a number of limitations. Theprogramming model does not fit in with other .NET code (we do not write other code as a series of steps added to a list), and follows a linear pipeline without merging/branching (e.g. with data from multiple sources, or train/test splitting of data).

The recent proposal for a major API change by @TomFinley in issue #371 is a bit step forward towards a more natural programming model, with each step of the pipeline new-ed up in turn. I would argue however that this no longer reflects the true flow through a learning pipeline, with previous steps being relegated to a parameter of the constructor. This proposal builds on top of #371 with a fluent API.

Proposed API

By using extension functions (in a similar manner to LINQ) we can pass the previous step of a pipeline as the 'this' parameter into subsequent steps, preserving the natural flow. For example,

var loader = new TextLoader(new MultiFileSource(dataPath),
        useHeader: true, separator: ',',
        cols: new[] { ... });
var transform = transform.AddConcatTransform(env, trans, "CategoryFeatures",
        "Bedrooms", "Bathrooms", "Floors", "Waterfront", "View", "Condition", "Grade",
        "YearBuilt", "YearRenovated", "Zipcode");
var transform = transform.AddCategoricalTransform("CategoryFeatures");

This could be further cleaned up to,

var pipeline = new TextLoader(new MultiFileSource(dataPath),
        useHeader: true, separator: ',',
        cols: new[] { ... })
    .AddConcatTransform(env, trans, "CategoryFeatures",
        "Bedrooms", "Bathrooms", "Floors", "Waterfront", "View", "Condition", "Grade",
        "YearBuilt", "YearRenovated", "Zipcode")
    .AddCategoricalTransform("CategoryFeatures");

More complex examples,

You could easily write extension functions that combine multiple steps, but could be consumed in the same way. Something like the following (I've created a hypothetical IDataPipeline to represent any pipeline step that produces data),

public IDataPipeline CreateCategories(this IDataPipeline input)
{
    return input.AddConcatTransform(env, trans, "CategoryFeatures",
            "Bedrooms", "Bathrooms", "Floors", "Waterfront", "View", "Condition", "Grade",
            "YearBuilt", "YearRenovated", "Zipcode")
        .AddCategoricalTransform("CategoryFeatures");
}

You could easily merge data from two pipelines,

var input1 = new.TextLoader(...)
        .DoSomeTransforms();
var input2 = new.TextLoader(...)
        .DoSomeMoreTransforms();

var input = input1.ConcatenateRows(input2);

You could take advantage of tuples to split the data pipeline, such that different steps could be applied before later merging,

var (train, test) = input.AddTrainTestSplit(...);

train.DoSomeTransforms();
test.DoSomeMoreTransforms();

Summary

This is an outline proposal for an alternative API that could be used alongside, or instead of that proposed in issue #371. There are still some rough edges here and there, but I hope that this will start a discussion of the posibilites provided by a fluent API.

@eerhardt eerhardt added this to Proposed in API Proposals Jul 19, 2018

@Zruty0

This comment has been minimized.

Show comment
Hide comment
@Zruty0

Zruty0 Jul 25, 2018

Member

@Andy-Wilkinson, I like this idea. Adding a number of extension methods that allow for common operations is attractive.

I think that, because they represent a 'convenience layer' on top of the API, they probably should be put into a separate namespace, so that the user can 'opt in' to use them.

Not to mention that a user could create their own custom layer of extension methods on top of pipelines, tailored to the typical ML tasks that they perform.

Note that #371 actually puts a barrier against such usage (or at least calls for an extensive framework on top of direct instantiation, something we wanted to avoid), but the 'pipelines' as they currently are suggested in #581 will be compatible with these extensions.

Member

Zruty0 commented Jul 25, 2018

@Andy-Wilkinson, I like this idea. Adding a number of extension methods that allow for common operations is attractive.

I think that, because they represent a 'convenience layer' on top of the API, they probably should be put into a separate namespace, so that the user can 'opt in' to use them.

Not to mention that a user could create their own custom layer of extension methods on top of pipelines, tailored to the typical ML tasks that they perform.

Note that #371 actually puts a barrier against such usage (or at least calls for an extensive framework on top of direct instantiation, something we wanted to avoid), but the 'pipelines' as they currently are suggested in #581 will be compatible with these extensions.

@Andy-Wilkinson

This comment has been minimized.

Show comment
Hide comment
@Andy-Wilkinson

Andy-Wilkinson Jul 25, 2018

Thanks for the comments @Zruty0. I did read #581 with interest as it fits well with the fluent style API I discussed here. I would disagree however that a fluent API would be an extensive framework on top of #371 - there would simply be some light-weight wrappers around the constructors. In fact, some of the work described in #371 regarding additional convenience constructors could equally be achieved via extension methods.

For example the ConcatTransform example would build on top of the existing constructor with,

public static IDataTransform AddConcatTransform(this IDataView input, IHostEnvironment env, string columnName, params string[] source)
{
    return new ConcatTransform(env,
        new ConcatTransform.Arguments() {
            Column = new[] {
                new ConcatTransform.Column() { Name = columnName, Source = source }
            }
        }, input);
}

Whether to move this 'convenience layer' into a separate namespace probably depends on what the target audience is for the framework. My understanding was that this was aimed at existing .NET developers who want a straightforward on-ramp for ML tasks, hence I would argue that a convinience API should be the default (and the direction guided to by documentation, samples, etc.). Only when performing more complex tasks should developers need to delve deeper.

Andy-Wilkinson commented Jul 25, 2018

Thanks for the comments @Zruty0. I did read #581 with interest as it fits well with the fluent style API I discussed here. I would disagree however that a fluent API would be an extensive framework on top of #371 - there would simply be some light-weight wrappers around the constructors. In fact, some of the work described in #371 regarding additional convenience constructors could equally be achieved via extension methods.

For example the ConcatTransform example would build on top of the existing constructor with,

public static IDataTransform AddConcatTransform(this IDataView input, IHostEnvironment env, string columnName, params string[] source)
{
    return new ConcatTransform(env,
        new ConcatTransform.Arguments() {
            Column = new[] {
                new ConcatTransform.Column() { Name = columnName, Source = source }
            }
        }, input);
}

Whether to move this 'convenience layer' into a separate namespace probably depends on what the target audience is for the framework. My understanding was that this was aimed at existing .NET developers who want a straightforward on-ramp for ML tasks, hence I would argue that a convinience API should be the default (and the direction guided to by documentation, samples, etc.). Only when performing more complex tasks should developers need to delve deeper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment