Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disintegration of MLJModels #244

Closed
ablaom opened this issue Apr 29, 2020 · 15 comments
Closed

Disintegration of MLJModels #244

ablaom opened this issue Apr 29, 2020 · 15 comments

Comments

@ablaom
Copy link
Member

ablaom commented Apr 29, 2020

Maintenance of MLJModels has become increasingly burdensome for several reasons. Perhaps the biggest problem is that it's centralised approach to providing model API implementations ("glue code") means:

  • Testing takes a long time. Tolerable, just.

  • [extras] must include a very large number of packages which invariably cause fatal version conflicts with the packges in [deps] during CI. (As I understand it, during a test, the [deps] are essentially pinned when [extras] are loaded.) Less tolerable.

  • With the existing package manager, we have no way to specify bounds on the algorithm-providing packages (the ones in [extras]). The latest release compatible with [deps] always get's loaded. If just one (of these many) packages makes a breaking change to the "glue code", then MLJModels CI fails.

While JuliaLang/Pkg.jl#1285 may help with the second and third issue, I don't think that is close to being resolved. We have also observed elsewhere that code loading using Requires.jl can be slower than otherwise. (And there is #243)

While the plan has always been for all algorithm-providing packages to implement their MLJ interfaces natively, this is not going to happen quickly. In the meantime, it would be good to address the issues above.

In discussions of the core team, @tlienart has suggested the following remedy: Move the glue code for each package X into its own repository Xglue, with its own testing, and make X an ordinary dependency of Xglue. (The package Xglue would be purely a "utility" package and essentially invisible to general users.)

Such migrations could be performed incrementally and I believe each migration would trigger only a patch release, basically because the model registry (which tracks where to find glue code) is part of MLJModels (and so is opaque to MLJ).

I think this a good idea and propose beginning this disintegation; See TODO list.
cc @DilumAluthge

@ablaom
Copy link
Member Author

ablaom commented Apr 29, 2020

See also @tlienart 's earlier post at JuliaAI/MLJ.jl#276

@tlienart
Copy link
Collaborator

tlienart commented Apr 29, 2020

Great, and just to stress it, we don't want this to be the rule, we want this to be a temporary solution and hope that the packages for which we end up writing glue code will integrate and own said glue code (we understand that some of these are slow moving packages with a lot of legacy and may be slow integrating stuff...)

We definitely do not want other package devs with brand new cool packages to suggest a glue package. They should just own the interface like ParallelKMeans, EvoTrees or MLJLinearModels

@DilumAluthge
Copy link
Member

FWIW, we now have support for storing multiple registered Julia packages inside subdirectories a single Git repository.

So e.g. if I already have a Git repository for my package MyPackage.jl, I can now make a subdirectory inside that repository, and inside that subdirectory I can store a package called MyPackageGlue.jl.

@DilumAluthge
Copy link
Member

FWIW, we now have support for storing multiple registered Julia packages inside subdirectories a single Git repository.

So e.g. if I already have a Git repository for my package MyPackage.jl, I can now make a subdirectory inside that repository, and inside that subdirectory I can store a package called MyPackageGlue.jl.

This works exactly the same as if MyPackage.jl and MyPackageGlue.jl are stored in different Git repositories. But I figure that at least some people will find it easier/more convenient to store MyPackage.jl and MyPackageGlue.jl in the same Git repository.

@tlienart
Copy link
Collaborator

that's great! though here I think that having separate repos has the advantage that when eventually the original package (e.g. MultivariateStats) owns the glu code, we can just archive the repo whereas here you'd have to remove a subrepos which might be messier to maintain overall (?)

@DilumAluthge
Copy link
Member

So it's not a subrepo or anything fancy. It's just a regular folder. So if you want to move the glue code to the main package, just rename and move the files.

@DilumAluthge
Copy link
Member

DilumAluthge commented Apr 29, 2020

Does that explanation make sense?

Here's an example repo with two packages: https://github.com/DilumAluthge/FooBar

There's nothing fancy, just regular folders. If you want to move the Bar package code into the Foo package code, just move/rename the files. If you want to delete the Bar package entirely, just delete the files.

@ablaom
Copy link
Member Author

ablaom commented Apr 29, 2020

@DilumAluthge How do release notes work in a Repo hosting multiple packages?

@DilumAluthge
Copy link
Member

The idea is that you would make tags that look like this:

  • PkgA-v1.2.3
  • PkgB-v4.4.6

In the appropriate tag, you have release notes for the relevant package only.

@DilumAluthge
Copy link
Member

You can have any system you want of course. That is just a suggestion.

@tlienart
Copy link
Collaborator

tlienart commented Apr 30, 2020

I think this is cool but I also think that in this particular case having separate repos is simple & would help see clearly where the code ought to be & the git history, I have a feeling this will cause more headaches especially for new maintainers.

Our end goal here is that this code ends up outside MLJModels and is integrated in other packages so I think having subfolders (I understand it's not subrepos) just does not seem to be the right approach here even if it's fully supported by Pkg.

@DilumAluthge
Copy link
Member

Yeah, given the long term goal here, I think it makes sense to keep these "glue" packages in separate Git repositories.

@ablaom
Copy link
Member Author

ablaom commented Jun 10, 2020

Packages with implementation code to migrate out to separate package:

  • ScikitLearn
  • MultivariateStats
  • DecisionTree
  • GLM
  • LIBSVM
  • NaiveBayes
  • NearestNeighbors (In progress)
  • Clustering
  • XGBoost*

Those marked with * are candidates for algorithm-providing package to host model implementation code.

Others get migrated to new package called "MLJPackageInterface".

Important

The package_name trait should stay the same but the load_path traits will need to reflect new location of the model implementation code. For example:

name = "Birch",
package_name = "ScikitLearn",
load_path = "MLJScikitLearnInterface.Birch"

cc @tlienart

@DilumAluthge
Copy link
Member

The MLJ.jl README has a very nice visualization of the dependency graph, stored in the repo at https://github.com/alan-turing-institute/MLJ.jl/blob/master/material/MLJ_stack.svg

As we break up MLJModels and MLJBase, we should definitely remember to keep this image up-to-date.

@ablaom
Copy link
Member Author

ablaom commented Feb 8, 2021

All done 🎉

Thanks to all who helped in this large project, especially @OkonSamuel @tlienart and @ExpandingMan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants