Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate flux models #33

Closed
ysimillides opened this issue Dec 11, 2018 · 21 comments
Closed

Integrate flux models #33

ysimillides opened this issue Dec 11, 2018 · 21 comments
Assignees
Labels
design discussion Discussing design issues enhancement New feature or request

Comments

@ysimillides
Copy link
Collaborator

Would be good to have some flux integration

@ysimillides
Copy link
Collaborator Author

@ayush1999 this might interest you, alongside @sjvollmer

@ysimillides ysimillides added the help wanted Extra attention is needed label Dec 11, 2018
@ayush1999
Copy link
Contributor

ayush1999 commented Dec 12, 2018

@ysimillides Definitely interested. I opened an issue regarding this : #19 . Looks like things have changed since then, right?

@ablaom
Copy link
Collaborator

ablaom commented Dec 12, 2018

Things have changed but the API for external packages has stabilised. See here for the spec.

You may also want to look at KoalaFlux which is run under Koala. A nice feature here is that categorical features are handled, through learned feature embeddings. Once can then export the learned embeddings as a pre-transformation for other models that don't handle categoricals. However, this should not be something to incorporate in a first implementation, unless you decide to just port the Koala code (which I may have a go at if I get time).

The key design question is how to encode the desired neural network architecture as hyper parameters of the the MLJ model. In KoalaFlux the model gets a hyperparameter network_creator which is a function which maps an integer (the number of input features) into a Flux.Chain object (see the KoalaFlux test code). This requires the user is familiar with building a Flux chain so may not be ideal as a final solution. While in the short-term I think this is fine, I welcome suggestions for better way.

Also, the optimiser should be a hyperparameter, which I did not get around to doing (just used momentum).

Warning: "model" in MLJ terminology != "model" in Flux terminology. In MLJ a "model" is the name for a container of hyper-parameters.

@fkiraly
Copy link
Collaborator

fkiraly commented Dec 12, 2018 via email

@ablaom
Copy link
Collaborator

ablaom commented Dec 12, 2018

@fkiraly

Yes, yes, in my hasty revisions some things have gotten muddled. Mia colpa.

  • multivariate + nominal should not be the same as multi-class! multivariate + nominal should be the same as multi-target – not the same as multi-class!

This is a mistake. "multiclass" means more that two classes for any target (usually one) "multivariate" means multiple targets.

  • on a similar note: you want to think very carefully about what the (back-end and front-end) return type of “probabilistic + multivariate” should be…

What do you suggest?

  • operations: should predict_proba not simply be predict of the model is probabilistic? Otherwise we’re schizophrenic in whether the meta-data describe a single thing (e.g., probabilistic binary classification), or multiple things the model can potentially do (deterministic binary classification, and probabilistic binary classification, and perhaps other things as well). Some thought might be appropriate here to not muddle the metadata description.

In my view a model might do multiple things and things and I agree that I have muddled the metadata description. I suggest returning to earlier formulation by dumping "probabilistic" as a descriptor of outputs. The "operations" key tells you if a probabilistic prediction is possible (by including "predict_proba" as a value, in addition to "predict"). See the the new "adding_new_models.md just pushed.

  • would vote for an additional optional method, “fitpredict”, which is fitting and prediction. Can add this later though.

Sure. That can be achieved trivially (and simultaneously for all models) at the level of the "machine interface" and is not needed for the model interface.

Regarding @fkiraly comments on NN. I agree, we should allow the user to interact with Flux at different levels of abstraction. As, I say, to start with, let's implement the low-level variety, i.e. user essentially specifies the whole code (which in Flux is not that bad) that generates the architecture (given, say some information about the data, say how many inputs).

Okay, I'd best be off to the airport.

@fkiraly
Copy link
Collaborator

fkiraly commented Dec 12, 2018

on a similar note: you want to think very carefully about what the (back-end and front-end) return type of “probabilistic + multivariate” should be…

What do you suggest?

Answered this in #34.

In my view a model might do multiple things and things and I agree that I have muddled the metadata description. I suggest returning to earlier formulation by dumping "probabilistic" as a descriptor of outputs.

Would suggest against doing this!
This further muddles the interface in my opinion, by trying to infer from names of methods a struct has what it does.
"secret knowledge" is not a good API design principle...

@ablaom
Copy link
Collaborator

ablaom commented Dec 14, 2018

I don't see that the knowledge is "secret" since the information goes into the metadata. That is, if predict_proba is implemented for a model DecisionTreeClassifier, then I will have

metadata(DecisionTreeClassifier)["operations"] = ["predict", "predict_proba"]

That said, I would like to understand the alternative you are suggesting. Are you suggesting that instead of ONE model DecisionTreeClassifer with TWO methods predict and predict_proba, we instead have TWO models DecisionTreeClassifier and DecisionTreeClassifierProbabilistic each with a UNIQUE predict method, and write

metadata(DecisionTreeClassifier)["outputs_are"] = ["nominal", "multiclass"]
metadata(DecisionTreeProbabilistic)["outputs_are"] = ["nominal", "multiclass", "probabilistic"]

?

If so, what are the advantages, apart from being a little more explicit about purpose? Some disadvantages that I see are:

  • More models means more code duplication, or some other complication like type parameters or macros.
  • If I want both types of prediction, I have to train two separate models, or introduce a mechanism to convert one prediction type into another
  • An incongruence with the design elsewhere: For transformers (eg, standardizers) it makes far less sense to split a model into two - one for transforming and one for inverse-transforming - because I frequently will want to use both methods, and these methods share the same fit-result (and I cannot just use the output of one of the methods to get the output of the other, as in the case of classifier predictions).

If I have misunderstood your suggestion, can you please explain your alternative in more detail?

@fkiraly
Copy link
Collaborator

fkiraly commented Dec 15, 2018

I think it would be cleaner if you have multiple models.

unless the interface is able to explicitly tell you that the DecisionTreeClassifier can do both (i.e., can have multiple metadata entries), this will imply the convention that all models with the probabilistic flag will have to implement both the probabilistic and the deterministic variant.

In the clean world, getting the determinstic prediction from thresholding is natural through attaching a target transformer, which introduces the "threshold" hyper-parameter.

If you want both types of prediction, the interface could recognize that you're asking the probabilistic model for a deterministic prediction, and automatically convert by applying the 1/2 thresholder say (which is not always the best solution regarding misclassification rate! A trained threshold on training data may be better). In such a case, only one method has to be defined.

The obvious problem with your solution is that you force users to provide the deterministic "predict" functionality, and this is usually done by 1/2 thresholding. The fact that here a choice is made is swept entirely under the rug, and it may force users to make a choice that is unnatural.

The last problem about design incongruence I do not see: I do not think this would imply you would have to split transformers. For classifiers, we are talking about tasks, i.e., what does it do. The classical classifier design solves two tasks in one. Whereas for a target transformer, transform-input and transform-output is part of solving a single task, so this perceived incongruence rather looks like an error of categories to me, sorry.

@ablaom
Copy link
Collaborator

ablaom commented Mar 18, 2019

Returning to the original issue, I would like to say I am rethinking the way in which deep-learning is to be integrated with MLJ.

I think we can have a more seamless integration of deep learning and the other paradigms after realising that once we have a general gradient descent tuning strategy (for suitably hyperparameters of pure Julia algorithms), then our learning networks (exported as models) are essentially generalisations of neural networks. Our tuning strategy will allow tuning of (possibly nested) hyperparameters of such "generalized networks" and to incorporate component models that are standard neural network architectures, we simply wrap them as models in which we declare the network weights (in a Flux chain for example) as hyperparameters, rather than learned parameters. Since these parameters completely determine the model, fit for these models essentially does nothing and the training of weights is externalised to MLJ (which has integrated the SVG optimiser and its variants as way of tuning the hyperparameters of any model.)

I admit this is a bit confusing at first. Is any of this making sense to others?

@fkiraly
Copy link
Collaborator

fkiraly commented Mar 18, 2019

Um, wouldn't this mean replicating all the features of flux, and, at the same time, generalizing them?

This sounds like a project as large as mlj itself... I like the idea since it would allow you to build "learning networks" for arbitrarily specified (and arbitrarily complex) input/output combinations.

However, it also seems very ambitious given the current development team.
Maybe it would be helpful to get flux's opinion on this?

And in the interim, an integration which is seamless only on the interface level (rather than down to the full model specification) might be the way to go?

Also, in general, the neural network specific syntax of flux is very helpful for building neural network architectures, or retrieving default architectures. I don't think one would enjoy building a deep neural network by manually stitching together layers of GLM...

@datnamer
Copy link

datnamer commented Mar 18, 2019

Makes sense to me theoretically. Implementation feasibility aside, this was the insight behind the original Julia ML http://www.breloff.com/transformations/ where nodes/layers would be any transformation, including traditional learning algorithms.

Edit: On the other hand, flux isn't supposed to be just a bunch of layers...with the whole differentiable programming paradigm, it seems like MLJ is actually a subset of what can be expressed with flux. https://fluxml.ai/2019/02/07/what-is-differentiable-programming.html

@fkiraly
Copy link
Collaborator

fkiraly commented Mar 18, 2019

@datnamer - why do you think MLJ is a subset of flux? I don't think currently the expressibility of one, in terms of modelling, is a subset of the other.

Of course both are Turing complete since you can write Julia in them, but I assume you mean this at the level of interface or composite construction?

Regarding transformations: yes, I think this is the right idea.
Though not every algorithm is an instance of "fitting the parameters by (regularised) gradient descent" - which seems to be a common misunderstanding of the deep learning age?

@fkiraly
Copy link
Collaborator

fkiraly commented Mar 18, 2019

On a side note, did Breloff leave any design documents behind for transformations? Or, is there a paper? And, is he still actively developing?

@ablaom
Copy link
Collaborator

ablaom commented Mar 18, 2019

@fkiraly

I don't think one would enjoy building a deep neural network by manually stitching together layers of GLM...

No, no. We don't duplicate the Flux syntax. You define a Flux chain, using their nice syntax. Then you have a standard wrapper for such objects, allowing you to slot them in as component models in an MLJ "learning network" (which might include non-nn components). Only the training of the nn get's externalised to MLJ (by declaring the neural net weights as model hyperparameters, instead of regarding them as parameters to be learned by calling fit) not its' specification.

So syntax might look something like:

transformer = FluxWrapper(chain=Dense(100, 10, Flux.σ)) # dimension reducer
regressor = LightGBM(alpha=0.1) # a non-neural network model
composite = @pipeline transformer regressor

If you wrap composite in a SGD tuning strategy, and specify transformer.chain and regressor.alpha as the hyperparameters to be tuned, then fitting the tuned-model-wrap will simultaneously train the weights of the nn, and tune the regularisation parameter alpha of the regressor (assuming LightGBM is written in Julia, and so tune-able by SGD). For hyperparameters that cannot be tuned by SGD (eg, regressor.max_depth), you do separate tuning wraps (with different tuning strategies, such as grid search). As we have at present, these wraps could either be local (for tuning models individually) or global (to tune parameters in multiple component models simultaneously). Also, if we want a flux model to present more "conventionally", we could wrap it locally in a SGD tuning strategy, but there are times you might not want to do this.

@fkiraly
Copy link
Collaborator

fkiraly commented Mar 18, 2019

Ah, I didn't think of that! You have a tuning wrapper which can generically tune SGD-fittable parameters and would fully interface with a composite model within? That would be genious (if it would work).

Also very interesting, since it looks like an instance of separating a "fitting strategy" from the "model specification" on the level of composites, which I don't quite understand how it could/should look generically.

@fkiraly
Copy link
Collaborator

fkiraly commented Mar 18, 2019

Regarding the last bit, remembering an instance of this: Bayesians occasionally do this within the "probabilistic programming" paradigm. Though, of course, that only supports Bayesian style fitting...

@ablaom
Copy link
Collaborator

ablaom commented Mar 19, 2019

You have a tuning wrapper which can generically tune SGD-fittable parameters

Well, we don't have it yet, but getting this working has been, I understand, a goal that predates my involvement in MLJ. I think @tlienart had a go at this already for some restricted class of models. Flux already provides an AD module we can use. The problem that I see is that the parameters to be tuned (ie, with respect to which we want to differentiate) must be wrapped as "tracked" arrays, and I don't see how we can do this from outside the model (whose hyperparameter types are fixed). However I understand @MikeInnes has been working on changes to the AD engine that might make this easier. Perhaps he can clarify.

So, yes, this is still somewhat speculative. The main point I want to raise is that we should not be in a hurry to integrate neural networks in the naive way (fitted in isolation), if there is a more elegant solution around the corner.

@fkiraly
Copy link
Collaborator

fkiraly commented Mar 19, 2019

@ablaom yes, with "you have" I meant, of course, "in the context of the plan/design which is not yet implemented".

Though I disagree with the conclusion: as you say the full "learning network" design is somewhat speculative, and would be one-of-a-kind (so who knows whether in the end it's brilliant or just a curiosity). Whereas integration-by-interface, e.g., a simple wrapper for a flux architecture specification, is not a lot of work, and the design is obvious.

Rephrasing, I wouldn't avoid integrating an important model class entirely because there's highly interesting (but risky) research to be done on it - especially since it seems it's a quick (but somewhat boring) job? Maybe there are volunteers...

@datnamer
Copy link

datnamer commented Mar 19, 2019

@fkiraly :

Of course both are Turing complete since you can write Julia in them, but I assume you mean this at the level of interface or composite construction?

Yes, that's what I meant. Flux models can be just functions soon as I think even the need for layers might be going away: https://github.com/FluxML/model-zoo/blob/notebooks/other/flux-next/intro.ipynb (@ablaom, you can take a look at that notebook to see where flux is going. The boilerplate for models is going to be reduced and tracker types will no longer be needed when Zygote.jl is ready for prime time. More info on the AD and compiler stuff here: https://drive.google.com/file/d/1tK4n3qQ5YsJkLc-8FEw5JMa90gHHfh3i/edit )

On a side note, did Breloff leave any design documents behind for transformations? Or, is there a paper? And, is he still actively developing?

You can find some discussion here: https://github.com/JuliaML/Roadmap.jl/issues and on the blog I linked. @Evizero might know more. I don't think @tbreloff is still working on Julia open source.

@fkiraly
Copy link
Collaborator

fkiraly commented Mar 19, 2019

@datnamer thanks - the Roadmap.jl doesn't look like it has a full set of design or org documents, most issues are more like an eclectic feature wishlist? There are also partial designs which seem to focus on optimization based machine learning methods (?). A number of the thoughts might be useful.

The fate of Roadmap.jl also makes me think, @ablaom - perhaps at some point we may like to write up the key design decisions for the benefit of future generations, just in case we all get run over by the bus or something.

@ablaom ablaom added design discussion Discussing design issues and removed help wanted Extra attention is needed labels May 23, 2019
@ablaom
Copy link
Collaborator

ablaom commented Jun 22, 2020

Registration of the the new MLJFlux models with the MJL model registry: JuliaRegistries/General#16728

using MLJModels
julia> models("Flux")
4-element Array{NamedTuple{(:name, :package_name, :is_supervised, :docstring, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :is_pure_julia, :is_wrapper, :load_path, :package_license, :package_url, :package_uuid, :prediction_type, :supports_online, :supports_weights, :input_scitype, :target_scitype, :output_scitype),T} where T<:Tuple,1}:
 (name = ImageClassifier, package_name = MLJFlux, ... )                  
 (name = MultitargetNeuralNetworkRegressor, package_name = MLJFlux, ... )
 (name = NeuralNetworkClassifier, package_name = MLJFlux, ... )          
 (name = NeuralNetworkRegressor, package_name = MLJFlux, ... )          

@ablaom ablaom closed this as completed Jun 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design discussion Discussing design issues enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants