Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-organizing the MLJ stack #317

Closed
ablaom opened this issue Nov 4, 2019 · 9 comments
Closed

Re-organizing the MLJ stack #317

ablaom opened this issue Nov 4, 2019 · 9 comments
Labels
design discussion Discussing design issues MLJ-org

Comments

@ablaom
Copy link
Member

ablaom commented Nov 4, 2019

With auto-merge in place, the pains of rolling out breaking changes in the stack are greatly mitigated and I think the time is ripe for increasing modularisation. If nothing else, we could all benefit from a greater distribution of testing, as this is becoming increasingly painful.

Here are some suggestions to start the discussion rolling.

New repos

  • MLJManual: for the MLJ manual, modelled on the present MLJTutorials. Generation of documentation is a serious slowdown for testing in the present MLJ repo (Move the MLJ manual to new repo MLJManual #316)

  • MLJResampling: for the general resampling algorithm (evaluate!) and in-house resampling strategies (Holdout, CV, etc).

  • MLJTuning: for the general tuning algorithm and in-house tuning strategies (Improve the tuning strategy interface #315).

  • MLJComposition: for the machine and composite model functionality (the two are currently integrated).

I suggest the raw interface for tuning strategies and resampling strategies live in MLJBase. Then out-of-house implementations of these strategies need only import MLJBase (and import MLJTuning for testing only).

Dependencies

dependency_tree

A question remains how best to handle a composite model that one wants to add to the registry. If it is defined in an external package, then that package can have MLJComposition as a dependency. If, however, it is defined somewhere in MLJModels, then I guess MLJModels would have to add MLJComposition to its hard dependencies, which would be unfortunate.

Thoughts on any of this, anyone?

cc: @DilumAluthge

@ablaom ablaom added design discussion Discussing design issues MLJ-org labels Nov 4, 2019
@ablaom
Copy link
Member Author

ablaom commented Nov 4, 2019

Okay, I've had a closer look and can see that separating machines from composition is pretty easy. And the machine functionality is a very small amount of code that could live in MLJBase.jl, simplifying the dependencies further.

dep_tree

@juliohm
Copy link
Contributor

juliohm commented Nov 6, 2019

It is wonderful to see this modularisation of the stack ❤️ Thank you for the brainstorm, and for the initiative. I would like to take this opportunity and share my own experience regarding similar modularisation in GeoStats.jl.

In my case, I decided to put everything in GeoStatsBase.jl and only lift functionality to separate packages when there was a clear concept that could be used independently from the stack. For example, the Variography.jl package contains tools that can be used standalone without operating directly with the GeoStatsBase.jl types. Similarly, the KrigingEstimators.jl package contains estimators that operate on raw Julia arrays, and deserve a separate package. All the other functionality that assumes types defined in GeoStatsBase.jl live there.

The downside of modularising a project that way is that GeoStatsBase.jl can get very big. Happily, we have Requires.jl nowadays to solve the big dependency issue in packages that just want to define new learning models. The main benefit of this approach is really on the dev side. The release process is more smooth, and everyone is touching the same codebase on the same Git repo.

The diagram that was brainstormed above makes a lot of sense to me. I only point out that some of these modules do not work standalone. For example, MLJTuning.jl assumes learning models, so maybe it could be absorbed into MLJBase.jl to save this hard split in the codebase. On the other hand, we could exemplify something like LossFunctions.jl that could live in a separate package because they are simple Julia functions (maybe depending on ScientificTypes.jl). The decision of creating a separate package versus adding a submodule in MLJBase.jl could be based on some criteria like that.

Regardless of the final modularisation, we will be able to collaborate much more now that the implementations and names are self-contained in smaller packages as opposed to the umbrella MLJ.jl. My personal view is that MLJ.jl is a curated combination of machine learning packages. Similarly, GeoStats.jl is a curated combination of packages for spatial problems. And many other packages could be developed in the future in a modular way, reusing parts of the learning stack (e.g. TimeSeries.jl, CoDa.jl).

@JockLawrie
Copy link

As another data point, I've been training MixtureModels in a custom way for years and I'm having trouble training them using the MLJ interface.

However, I can get the component distributions to satisfy the MLJ interface. Then I use a custom sample-fit-combine approach to construct the ensemble. See this SampleFitCombine.jl package (unregistered) for details and examples.

I'd love to be able to do this using MLJ, which would remove the need for SampleFitCombine and avoid cluttering the ecosystem. Can you see a way to do this?

@ablaom
Copy link
Member Author

ablaom commented Nov 25, 2019

Thanks for you query.

I'm guessing that MLJ's EnsembleModel wrapper provides at some of the functionality of SampleFitCombine (the bagging). Is this what you're after?

Happy to discuss extensions. Contributions welcome.

See https://alan-turing-institute.github.io/MLJ.jl/dev/homogeneous_ensembles/ (manual) and https://alan-turing-institute.github.io/MLJTutorials/pub/getting-started/ensembles-2.html (tutorial)


@JockLawrie
Copy link

Yes the homogenous ensembles provide some of the functionality. I think it could be more general, as per SampleFitCombine, which also allows other ways to sample training data and combine learners.

For example, see the test case which partitions the training data by clustering the predicted training data according to cdf(predict(ensemble, training_data), training_data_point). This is the Sample step. Then a learner is fitted to each cluster of data points in the partition (Fit), then the learners are combine with weights that minimize the loss (Combine).

I'm happy to try and make contributions, but at this point I think it better to flesh out what it would look like first. SampleFitCombine is my proposal, namely:

  1. The Ensemble struct
  2. Sample methods constructed independently, using wsample! in StatsBase for example.
  3. Fit methods are already provided by MLJ.
  4. Three combine methods (uniform, pre-specified, optimal)
  5. predict(ensemble, Xtest)
  6. loss(ensemble, X, y, lossfunction)
  7. Conveniences

The Ensemble struct in SampleFitCombine is serves the same purpose as MLJ's EnsembleModel. However the latter seems to assume that bagging is the primary/only mechanism for constructing ensembles. But training data may be sampled in many ways, not just via bagging. And learners may be fit independently or in sequence. E.g., boosting can be expressed as a sequential sample-fit-combine method.

@ablaom
Copy link
Member Author

ablaom commented Nov 28, 2019

@JockLawrie Continuing the EnsembleModel discussion at #363

@ablaom
Copy link
Member Author

ablaom commented Dec 13, 2019

Returning to the proposed re-organization:

Based on the discussion here and elsewhere, I think the following is not too controversial: Pulling the composite model API down to MLJBase makes sense because implementers of the MLJ model interface may want to include versions of their models wrapped in pre-transformations of the inputs and target, for example (to provide versions that handle mixed data types) - or to get creative in other ways. So let's start with that:

  • Move machines.jl, networks.jl, composite.jl, pipelines.jl, arrows.jl to from MLJ to MLJBase
  • Move resampling.jl to MLJBase

@ablaom
Copy link
Member Author

ablaom commented Jan 7, 2020

Related discussion: #417

@ablaom
Copy link
Member Author

ablaom commented Feb 26, 2020

The table below summarises progress on this issue:

MLJ_stack-8

I'm closing this issue now. There are some question marks around MLJMeasures and MLJEnsembles (not yet existing but maybe should) and separate issues can address these.

@ablaom ablaom closed this as completed Feb 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design discussion Discussing design issues MLJ-org
Projects
None yet
Development

No branches or pull requests

3 participants