Re-organizing the MLJ stack #317

ablaom · 2019-11-04T02:22:37Z

With auto-merge in place, the pains of rolling out breaking changes in the stack are greatly mitigated and I think the time is ripe for increasing modularisation. If nothing else, we could all benefit from a greater distribution of testing, as this is becoming increasingly painful.

Here are some suggestions to start the discussion rolling.

New repos

MLJManual: for the MLJ manual, modelled on the present MLJTutorials. Generation of documentation is a serious slowdown for testing in the present MLJ repo (Move the MLJ manual to new repo MLJManual #316)
MLJResampling: for the general resampling algorithm (evaluate!) and in-house resampling strategies (Holdout, CV, etc).
MLJTuning: for the general tuning algorithm and in-house tuning strategies (Improve the tuning strategy interface #315).
MLJComposition: for the machine and composite model functionality (the two are currently integrated).

I suggest the raw interface for tuning strategies and resampling strategies live in MLJBase. Then out-of-house implementations of these strategies need only import MLJBase (and import MLJTuning for testing only).

Dependencies

A question remains how best to handle a composite model that one wants to add to the registry. If it is defined in an external package, then that package can have MLJComposition as a dependency. If, however, it is defined somewhere in MLJModels, then I guess MLJModels would have to add MLJComposition to its hard dependencies, which would be unfortunate.

Thoughts on any of this, anyone?

cc: @DilumAluthge

The text was updated successfully, but these errors were encountered:

ablaom · 2019-11-04T19:01:47Z

Okay, I've had a closer look and can see that separating machines from composition is pretty easy. And the machine functionality is a very small amount of code that could live in MLJBase.jl, simplifying the dependencies further.

juliohm · 2019-11-06T18:04:51Z

It is wonderful to see this modularisation of the stack ❤️ Thank you for the brainstorm, and for the initiative. I would like to take this opportunity and share my own experience regarding similar modularisation in GeoStats.jl.

In my case, I decided to put everything in GeoStatsBase.jl and only lift functionality to separate packages when there was a clear concept that could be used independently from the stack. For example, the Variography.jl package contains tools that can be used standalone without operating directly with the GeoStatsBase.jl types. Similarly, the KrigingEstimators.jl package contains estimators that operate on raw Julia arrays, and deserve a separate package. All the other functionality that assumes types defined in GeoStatsBase.jl live there.

The downside of modularising a project that way is that GeoStatsBase.jl can get very big. Happily, we have Requires.jl nowadays to solve the big dependency issue in packages that just want to define new learning models. The main benefit of this approach is really on the dev side. The release process is more smooth, and everyone is touching the same codebase on the same Git repo.

The diagram that was brainstormed above makes a lot of sense to me. I only point out that some of these modules do not work standalone. For example, MLJTuning.jl assumes learning models, so maybe it could be absorbed into MLJBase.jl to save this hard split in the codebase. On the other hand, we could exemplify something like LossFunctions.jl that could live in a separate package because they are simple Julia functions (maybe depending on ScientificTypes.jl). The decision of creating a separate package versus adding a submodule in MLJBase.jl could be based on some criteria like that.

Regardless of the final modularisation, we will be able to collaborate much more now that the implementations and names are self-contained in smaller packages as opposed to the umbrella MLJ.jl. My personal view is that MLJ.jl is a curated combination of machine learning packages. Similarly, GeoStats.jl is a curated combination of packages for spatial problems. And many other packages could be developed in the future in a modular way, reusing parts of the learning stack (e.g. TimeSeries.jl, CoDa.jl).

JockLawrie · 2019-11-21T00:47:16Z

As another data point, I've been training MixtureModels in a custom way for years and I'm having trouble training them using the MLJ interface.

However, I can get the component distributions to satisfy the MLJ interface. Then I use a custom sample-fit-combine approach to construct the ensemble. See this SampleFitCombine.jl package (unregistered) for details and examples.

I'd love to be able to do this using MLJ, which would remove the need for SampleFitCombine and avoid cluttering the ecosystem. Can you see a way to do this?

ablaom · 2019-11-25T00:28:35Z

Thanks for you query.

I'm guessing that MLJ's EnsembleModel wrapper provides at some of the functionality of SampleFitCombine (the bagging). Is this what you're after?

Happy to discuss extensions. Contributions welcome.

See https://alan-turing-institute.github.io/MLJ.jl/dev/homogeneous_ensembles/ (manual) and https://alan-turing-institute.github.io/MLJTutorials/pub/getting-started/ensembles-2.html (tutorial)

JockLawrie · 2019-11-25T02:24:43Z

Yes the homogenous ensembles provide some of the functionality. I think it could be more general, as per SampleFitCombine, which also allows other ways to sample training data and combine learners.

For example, see the test case which partitions the training data by clustering the predicted training data according to cdf(predict(ensemble, training_data), training_data_point). This is the Sample step. Then a learner is fitted to each cluster of data points in the partition (Fit), then the learners are combine with weights that minimize the loss (Combine).

I'm happy to try and make contributions, but at this point I think it better to flesh out what it would look like first. SampleFitCombine is my proposal, namely:

The Ensemble struct
Sample methods constructed independently, using wsample! in StatsBase for example.
Fit methods are already provided by MLJ.
Three combine methods (uniform, pre-specified, optimal)
predict(ensemble, Xtest)
loss(ensemble, X, y, lossfunction)
Conveniences

The Ensemble struct in SampleFitCombine is serves the same purpose as MLJ's EnsembleModel. However the latter seems to assume that bagging is the primary/only mechanism for constructing ensembles. But training data may be sampled in many ways, not just via bagging. And learners may be fit independently or in sequence. E.g., boosting can be expressed as a sequential sample-fit-combine method.

ablaom · 2019-11-28T21:12:53Z

@JockLawrie Continuing the EnsembleModel discussion at #363

ablaom · 2019-12-13T03:13:25Z

Returning to the proposed re-organization:

Based on the discussion here and elsewhere, I think the following is not too controversial: Pulling the composite model API down to MLJBase makes sense because implementers of the MLJ model interface may want to include versions of their models wrapped in pre-transformations of the inputs and target, for example (to provide versions that handle mixed data types) - or to get creative in other ways. So let's start with that:

Move machines.jl, networks.jl, composite.jl, pipelines.jl, arrows.jl to from MLJ to MLJBase
Move resampling.jl to MLJBase

ablaom · 2020-01-07T04:01:00Z

Related discussion: #417

ablaom · 2020-02-26T02:28:36Z

The table below summarises progress on this issue:

I'm closing this issue now. There are some question marks around MLJMeasures and MLJEnsembles (not yet existing but maybe should) and separate issues can address these.

ablaom added design discussion Discussing design issues MLJ-org labels Nov 4, 2019

ablaom mentioned this issue Nov 4, 2019

Towards stabilisation of the core API #318

Closed

5 tasks

ablaom mentioned this issue Nov 28, 2019

Extending functionality of EnsembleModel / integration with SampleFitCombine.jl #363

Open

This was referenced Dec 13, 2019

Migration of machines and composite model API from MLJBase JuliaAI/MLJBase.jl#139

Merged

For 0.9.0 release JuliaAI/MLJBase.jl#140

Merged

Update to MLJBase 0.9 JuliaAI/MLJModels.jl#161

Merged

ablaom closed this as completed Feb 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-organizing the MLJ stack #317

Re-organizing the MLJ stack #317

ablaom commented Nov 4, 2019

ablaom commented Nov 4, 2019

juliohm commented Nov 6, 2019

JockLawrie commented Nov 21, 2019

ablaom commented Nov 25, 2019

JockLawrie commented Nov 25, 2019

ablaom commented Nov 28, 2019

ablaom commented Dec 13, 2019 •

edited

Loading

ablaom commented Jan 7, 2020

ablaom commented Feb 26, 2020

Re-organizing the MLJ stack #317

Re-organizing the MLJ stack #317

Comments

ablaom commented Nov 4, 2019

ablaom commented Nov 4, 2019

juliohm commented Nov 6, 2019

JockLawrie commented Nov 21, 2019

ablaom commented Nov 25, 2019

JockLawrie commented Nov 25, 2019

ablaom commented Nov 28, 2019

ablaom commented Dec 13, 2019 • edited Loading

ablaom commented Jan 7, 2020

ablaom commented Feb 26, 2020

ablaom commented Dec 13, 2019 •

edited

Loading