Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft version of ML.NET CLI specs with AutoML capabilities #2693

Merged
merged 8 commits into from
May 6, 2019
Merged

Draft version of ML.NET CLI specs with AutoML capabilities #2693

merged 8 commits into from
May 6, 2019

Conversation

CESARDELATORRE
Copy link
Contributor

@CESARDELATORRE CESARDELATORRE commented Feb 22, 2019

This is a draft version of ML.NET CLI specs to be discussed in the open with the ML.NET community.
Its initial functionality will be based on .NET AutoML (Which will be also part of ML.NET)

For further details, read the MLNET-CLI-Specs.md document in the PR.

Related issues:
#2694
#1203

@CESARDELATORRE CESARDELATORRE added enhancement New feature or request documentation Related to documentation of ML.NET command-line Issues pertaining to the command-line interface labels Feb 22, 2019
@codecov
Copy link

codecov bot commented Feb 22, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@412e1f9). Click here to learn what that means.
The diff coverage is n/a.

@@            Coverage Diff            @@
##             master    #2693   +/-   ##
=========================================
  Coverage          ?    71.7%           
=========================================
  Files             ?      809           
  Lines             ?   142489           
  Branches          ?    16116           
=========================================
  Hits              ?   102174           
  Misses            ?    35885           
  Partials          ?     4430
Flag Coverage Δ
#Debug 71.7% <ø> (?)
#production 67.93% <ø> (?)
#test 85.9% <ø> (?)

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it is going to be an awesome feature that will help many developers use machine learning!

docs/specs/mlnet-cli/MLNET-CLI-Specs.md Outdated Show resolved Hide resolved

![alt text](Images/MLNET-AutoML-Positioning.png "ML.NET and AutoML")

As mentioned at the begining of the spec doc, the CLI will be branded as the ML.NET CLI since this CLI will also have additional features where AutoML is not needed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow what this sentence is trying to convey.

the CLI will be branded as the ML.NET CLI since this CLI will also have additional features where AutoML is not needed.

"AutoML" is just a component of ML.NET.... This would be like saying "The .NET CLI will be branded as the .NET CLI since this CLI will also have additional features where Roslyn is not needed".

- Future versions will be able to run AutoML compute and other compute processes (such as a regular model training) in Azure.
- The ML.NET CLI will consume the AutoML API (Microsoft.ML.Auto NuGet package) which will only consume public surface of ML.NET.
- The CLI proposed here will not provide for “continue sweeping” after sweeping has ended.
- When running locally with the by default behaviour (no Azure), the CLI will not make any webservice calls and will not require any authentication.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) wording here is a bit hard to understand - When running locally with the by default behaviour (no Azure),. Maybe changing it to When training locally, the CLI will not make any ....

- Future versions will be able to run AutoML compute and other compute processes (such as a regular model training) in Azure.
- The ML.NET CLI will consume the AutoML API (Microsoft.ML.Auto NuGet package) which will only consume public surface of ML.NET.
- The CLI proposed here will not provide for “continue sweeping” after sweeping has ended.
- When running locally with the by default behaviour (no Azure), the CLI will not make any webservice calls and will not require any authentication.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the CLI will not make any webservice calls

This begs the question about telemetry. I assume this means this tool will not capture any telemetry, like the .NET CLI does?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should seriously consider sending opt-in telemetry from CLI.


In reply to: 259394498 [](ancestors = 259394498)


- Add additonal commands to do *"machine learning without code"*:
- *train*: It will only generate the best model .ZIP file. For example:
- `mlnet train --ml-task Regression --dataset "/MyDataSets/Sales.csv"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ml in --ml-task redundant? mlnet train --ml-task

The `mlnet new` command provides a CLI oriented way to create projects or solutions such as:

- Create a single project (console app) with:
- *Training ML.NET code:* One seggregated method per ranked model, but part of the same console app.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) spelling seggregated


This argument provides the filepath to either one of the following:

- *A: The whole dataset file:* If using this option and the user is not providing `--test-dataset` and `--validation-dataset`, then cross-validation (k-fold, etc.) or automated data split approaches will be used internally for validating the model. I nthat case, the user will just need to provide the dataset filepath.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) I nthat case

- `true`
- `false`

The by default value is `true`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that if I specify --has-header on the command line, it will default to true? Or does it mean that if I DON'T specifiy --has-header it will default to true?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in either case it defaults to true.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, then the parameter name should be flipped the other way --no-header. It doesn't make sense to have a command:

foo.exe

and

foo.exe --my-option

Do the same things.

Copy link
Member

@srsaggam srsaggam Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this rule be applied to all Boolean arguments with default values in general?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically, yes.

Look at any other command line. Let's say git.

git-fetch - Download objects and refs from another repository

SYNOPSIS
git fetch [<options>] [<repository> [<refspec>…​]]
git fetch [<options>] <group>
git fetch --multiple [<options>] [(<repository> | <group>)…​]
git fetch --all [<options>]

By default, git fetch won't fetch from all remotes. But when you specify git fetch --all, it does.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


- Will overriden default values just be provided as static values in the CLI commands or also based on a related configuration .JSON file placed along with the CLI executable? See [comparable .JSON files for dotnet CLI templates](https://github.com/dotnet/dotnet-template-samples/blob/master/05-multi-project/.template.config/template.json))

- Support Custom templates for "automlnet new" such as [Custom templates for dotnet new](https://docs.microsoft.com/en-us/dotnet/core/tools/custom-templates)? - That could allow extensibility for other application project types or even for other languages like F# or additional scenarios.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is "automlnet new"?


- Stopping criteria - { Default timeout or timeout provided by the user }

Gleb's (Cesar: Although, ins't this related to AutoML API instead the CLI?):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this and the below "priority order" list.

Is this just an author's "TODO Notes" list?

(nit) ins't.

- [Uber Ludwig CLI Blog Post](https://eng.uber.com/introducing-ludwig/)
- [Uber Ludwig CLI Getting Started](https://uber.github.io/ludwig/getting_started/)
- [Uber Ludwig CLI syntax](https://uber.github.io/ludwig/user_guide/)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this as a potential reference:

TransmogrifAI (pronounced trăns-mŏgˈrə-fī) is an AutoML library for building modular, reusable, strongly typed machine learning workflows on Spark with minimal hand tuning

TransmogrifAI Site
TransmogrifAI Repo


## Context

**Commitments: This specs document is 100% aspirational and will change while it's being discussed and implementation is evolving based on feedback. There are no commitments derived from this document except for the first upcoming minor version at any given time (v0.1, initially).**
Copy link
Contributor

@glebuk glebuk Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v0.1, [](start = 264, length = 5)

If CLI is bound to specific version of ML.NET, consider synchronizing the version in some way to know that this version of CLI works for a given version of ML.NET. For examle, consider calling CLI for ML.NET v0.10 to be CLI v0.10. Otherwise it would be a version zoo. Then you can simply refer to -- CLI for release v0.11 and so fourth - align releases with ML.NET as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coupling like that can't happen unless we want to always tie the versions together. For example, when we ship ML.NET 1.0, will the CLI be 1.0? I assume not... But I agree it does become a version zoo (we on the .NET team know...), but it is necessary until you can sync up the schedules.

See https://github.com/dotnet/designs/blob/master/accepted/sdk-version-scheme.md for how .NET Core tackles this problem.


**Commitments: This specs document is 100% aspirational and will change while it's being discussed and implementation is evolving based on feedback. There are no commitments derived from this document except for the first upcoming minor version at any given time (v0.1, initially).**

The CLI will be branded as the ML.NET CLI since this CLI will also have additional features in addition to AutoML features.
Copy link
Contributor

@glebuk glebuk Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this CLI will also have additional features in addition to AutoML features. [](start = 42, length = 81)

remove. This fragment makes no sense in this context as AutoML has not yet been introduced or mentioned above. Basically you want to rephrase this paragraph to say tha tthis is an ML.NET CLI that would also use some AutoML features that will be included in the future.

The .NET AutoML API (.NET based) will be part of the [ML.NET](https://github.com/dotnet/machinelearning) API.
AutoML features will be used for certain important foundational features of the ML.NET CLI.

This specs-doc focuses most of all on the CLI features related to AutoML, but it will also consider (in less detail) the scenarios where AutoML is not needed, so the CLI syntax will be consistent end-to-end for all the possible scenarios in the future.
Copy link
Contributor

@glebuk glebuk Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specs-doc [](start = 5, length = 9)

what is that? perhaps we should call it spec instead?


# Problem to solve

Customers (.NET developers) have tolds us through many channels that they can get started with [ML.NET](https://github.com/dotnet/machinelearning) and follow the initial simple examples. However, as soon as they have to create their own model to solve their problems, they are blocked because they don't know what learner/algorithms are better for them to pick and use, what hyper-parameters to use or even what data transformations they need to do.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tolds [](start = 33, length = 5)

told

@artidoro
Copy link
Contributor

@jwood803 who had a PR open #1620 on this subject.


We need a way to enable regular .NET developers to easily use [ML.NET](https://github.com/dotnet/machinelearning) to create custom models solving typical ML scenarios in the enterprise.

If we don't provide a really simple way to use [ML.NET](https://github.com/dotnet/machinelearning) for regular developers (almost no data science knowledge at all), then we won't be able to really "democratize" machine learning for .NET developers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't provide a really [](start = 0, length = 28)

rephrase this as a positive statement - > in order to democratize x we need to...


- Regular .NET developers getting started with machine learning while trying to use .NET (C# and F# most of all) for ML.

- Specific developer roles are: enterprise developers, start-up developers, ISV developers and internal MSFT teams developers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MSFT [](start = 104, length = 4)

internal slang, change to Microsoft

- Specific developer roles are: enterprise developers, start-up developers, ISV developers and internal MSFT teams developers.


# Goals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Goals [](start = 2, length = 5)

also a goal should be:

  • Ease deployment and productization of models via service templates
  • Teach best practices via generated code templates.


**Foundational features:**

- Provide an end-to-end **ML.NET CLI** for developers (i.e. *"mlnet new"*) to generate either the final trained model and the pipeline's C#/ML.NET implementation code in a similar fashion to the [.NET Core CLI](https://docs.microsoft.com/en-us/dotnet/core/tools/?tabs=netcore2x). The CLI is also a foundation upon which higher-level tools, such as Integrated Development Environments (IDEs) can rest.
Copy link
Contributor

@glebuk glebuk Feb 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mlnet new [](start = 62, length = 9)

Function should do one thing and do it well, thus:
mlnet new -> generates code
mlnet fit -> trains a model.
mlnet transofrm -> scores/inferences a model.
Having simple verbs do simple things would: simplify docs, simplify the api, clarify meaning, allow each one to be more powerful,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something to be discussed and validated/invalidated by users. My poit of view is that a "mlnet new" should generate the scoring application code with anything it needs to run (end-to-end) which means to have the .ZIP file, already. And since that was already created by AutoML, it can be provided without the user needing to do an additional step (train the model). Training code could be optional for users who want further learning or custom modifications of trainers and hyper-parameters, which is not common for regular .NET developers.

I think we need to think about what .NET developers new to ML.NET would want for their usual workflow more than structure everything in very granular operations "Function should do one thing and do it well". The SRP (Single Responsability Principle) applies to classes and methods, not necessarily to a CLI which should accommodate to the user's workflow most of all.

- The ML.NET CLI automation will be able to run locally, on any development environment PC (Windows, Mac or Linux).
- Future versions will be able to run AutoML compute and other compute processes (such as a regular model training) in Azure.
- The ML.NET CLI will consume the AutoML API (Microsoft.ML.Auto NuGet package) which will only consume public surface of ML.NET.
- The CLI proposed here will not provide for “continue sweeping” after sweeping has ended.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI proposed here will not provide for “continue sweeping” after sweeping has ended. [](start = 1, length = 89)

this should go under future features. This is a valuable functionality, but not needed for V1

- The CLI proposed here will not provide for “continue sweeping” after sweeping has ended.
- When running locally with the by default behaviour (no Azure), the CLI will not make any webservice calls and will not require any authentication.
- The CLI will provide feedback output (such as % work done or high level details on what's happening under the covers) while working on the long-running tasks.
- The ML.NET CLI will be aligned and integrated to the [.NET Core CLI](https://docs.microsoft.com/en-us/dotnet/core/tools/?tabs=netcore2x). A good approach is to implement the ML.NET CLI as a [.NET Core Global Tool](https://docs.microsoft.com/en-us/dotnet/core/tools/global-tools) (i.e. named "mlnet" package) on top of the "dotnet CLI".
Copy link
Contributor

@glebuk glebuk Feb 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

integrated [](start = 37, length = 11)

this paragraph basically repeats one under Foundational features.


### CLI default behaviour and overridability

The CLI will have default behavior for each of these mentioned features – however the CLI by default settings should be able to be overridden by providing new/overriden values in the console command (and optionally the advanced configuration .YAML file and response file .rsp placed along with the CLI executable).
Copy link
Contributor

@glebuk glebuk Feb 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.YAML file and response file .rsp [](start = 242, length = 33)

YAML or RSP but but not both. Having CLI, YAML and RSP is too much. How about CLI + YAML. RSP can be easily reproduced with the CMD file.

You can use it with:

```console
mlnet
Copy link
Contributor

@glebuk glebuk Feb 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add help command: [-h|--help]

mlnet
```

## Command 'new'
Copy link
Contributor

@glebuk glebuk Feb 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have two commands:

new - -- only generates the project template. Perhaps generates featurization and perhaps learning projects and code basted on heuristics, similar to the other tool's GUI wizard?
-- The model is obtained by compiling and running the generated project.
auto -- does all the automl stuff?


(*Release 0.2 examples*)

Simplest command where the tool infers the type of ML taks to perform based on the data:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taks [](start = 54, length = 4)

task


Create and train a model based on parameters specified in the .rsp file plus more advanced model settings in the .yaml file:

` mlnet new @my_cli_config_args.rsp --model-settings-file "./my_model_settings.yaml" `
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have the single format for everything?


--------------- (v0.1) -------------------

--ml-task <value>
Copy link
Contributor

@glebuk glebuk Feb 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value [](start = 11, length = 5)

For each argument:

  • specify default value
  • add short version such as [--ml-task| -t]
  • add list of supported argument values
  • specify if many can be added
  • imagine you have to type each command and it's a pain to have multi-line CLI. Consider make them as small as possible while maintaining readibility.

--test-dataset <value>
]

--label-column-name <value>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reduce name and index to a single parameter:
[--label-column| -label] Usually index/name can be understood from context.


--label-column-name <value>
|
--label-column-index <value>
Copy link
Contributor

@glebuk glebuk Feb 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

must add "feature", groupid, weight, ignore, and rowid columns at least eventually.
Need to support index syntax for columns, such as 0-4., 5,10-*
Having feature cols arg is a P0 feature. Without it, it would be impossible to use from the command line for most datasets.


[--has-header <value>]

[--max-exploration-time <value>]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[--timeout | t]

[--verbosity <value>]

[--name <value>]
[--list-ml-tasks]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be part of help.


--ml-task <value>

--dataset <value>
Copy link
Contributor

@glebuk glebuk Feb 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--dataset [](start = 0, length = 9)

let's be consistent rename dataset::
[--train-dataset | -data | -d ] -- data used for training or cross-validation.
Consider changing the name for better sortability and readibility:
--data-train, --data-test, --data-validation -- that way they will sort nicely and easy to find.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had that approach originally, --train-dataset | --dataset but based on the tests with the CLI it was getting pretty confusing. It is a lot simpler to re-use a single --dataset argument either for a single data file or for a training-dataset for a split approach.
In any case, we'll ask about this to the users and see what they prefer. I agree that this is a discussion point.

|
--label-column-index <value>

[--has-header <value>]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

header [](start = 7, length = 6)

need another argument - delimeter.


- Create a single project (console app) with:
- *Training ML.NET code:* One seggregated method per ranked model, but part of the same console app.
- *Scoring/consuming ML.NET code:* One seggregated method per ranked model, but part of the same console app.
Copy link
Contributor

@glebuk glebuk Feb 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One seggregated method per ranked model [](start = 38, length = 40)

Can we get a way with a single method for varous model versions? All the signatures for input/output data of all models will be the same. Only thing different is which zip file to load.


### Arguments

Invalid input of arguments should cause it to emit a list of valid inputs and an error message explaining which arg is missing, if that is the case.
Copy link
Contributor

@glebuk glebuk Feb 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other behaviour questions:
-What happens when we can not infer label?
-How do we tell what we decided the label and features be?
-Should we make certain section of our CLI interactive? For example,. if we infer label and feature columns, it would be nice for user to review our choice, optionally edit, and then accept.

- *Training project:* Console project with model-training ML.NET code
- *Common-code project:* Class library project with common code (Data/Observation class, Prediction class, etc.
- *End-user-app project:* End-user application type (depending on template) with ML.NET code scoring the model/s.
- *Trained models:* Multiple ranked trained models in the form of several .ZIP files.
Copy link
Contributor

@glebuk glebuk Feb 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trained models: Multiple ranked trained models in the form of several .ZIP files [](start = 6, length = 82)

That does not make sense to me. If the project has the "train" - why do we need to store models? Every time you run it, you will generate a new zip. End user should not check in models with training code.


- Generate the number of "best models" (project folders and file models) specified by `--best-models-count`

- Simple HTML report with minimum models' metrics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random thought, does this have to just be an HTML report? Can it also have an option to output the metrics to the console?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics will be on the console. The HTML report would be additional when needed an additional level of analysis, for instance, if comparing multiple models in a chart, etc.
The initial versions of the CLI will only have it on the console, though.
So, you're right. 👍

@bartczernicki
Copy link

I assume we are going to have the public NuGet be called "Automated ML"? (in line with what is in the AML service) AutoML is a Google's product that is different than this functionality.

@eerhardt
Copy link
Member

eerhardt commented May 3, 2019

@CESARDELATORRE - what's the status of this PR? Can it be either merged or closed?

@CESARDELATORRE
Copy link
Contributor Author

Can you merge it? I’d like to have it as a reference for the upcoming evolution.
We didn’t implement the whole scope planned in there for the first public preview we're releasing.

Keep is as "Draft" on the title, please.
Thanks,

@eerhardt eerhardt merged commit 7b7a2bc into dotnet:master May 6, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Mar 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
command-line Issues pertaining to the command-line interface documentation Related to documentation of ML.NET enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants