Integrate flux models #33

ysimillides · 2018-12-11T15:24:35Z

Would be good to have some flux integration

ysimillides · 2018-12-11T15:25:32Z

@ayush1999 this might interest you, alongside @sjvollmer

ayush1999 · 2018-12-12T06:25:15Z

@ysimillides Definitely interested. I opened an issue regarding this : #19 . Looks like things have changed since then, right?

ablaom · 2018-12-12T08:47:00Z

Things have changed but the API for external packages has stabilised. See here for the spec.

You may also want to look at KoalaFlux which is run under Koala. A nice feature here is that categorical features are handled, through learned feature embeddings. Once can then export the learned embeddings as a pre-transformation for other models that don't handle categoricals. However, this should not be something to incorporate in a first implementation, unless you decide to just port the Koala code (which I may have a go at if I get time).

The key design question is how to encode the desired neural network architecture as hyper parameters of the the MLJ model. In KoalaFlux the model gets a hyperparameter network_creator which is a function which maps an integer (the number of input features) into a Flux.Chain object (see the KoalaFlux test code). This requires the user is familiar with building a Flux chain so may not be ideal as a final solution. While in the short-term I think this is fine, I welcome suggestions for better way.

Also, the optimiser should be a hyperparameter, which I did not get around to doing (just used momentum).

Warning: "model" in MLJ terminology != "model" in Flux terminology. In MLJ a "model" is the name for a container of hyper-parameters.

fkiraly · 2018-12-12T10:26:49Z

Regarding spec: mostly agreed. Minor issues below. Metadata: - multivariate + nominal should *not* be the same as multi-class! multivariate + nominal should be the same as multi-target – not the same as multi-class! Instead, would add a flag “binary”, or “nominal(k)” where k is number of class labels. - on a similar note: you want to think *very* carefully about what the (back-end and front-end) return type of “probabilistic + multivariate” should be… - operations: should predict_proba not simply be predict of the model is probabilistic? Otherwise we’re schizophrenic in whether the meta-data describe a single thing (e.g., probabilistic binary classification), or multiple things the model can potentially do (deterministic binary classification, and probabilistic binary classification, and perhaps other things as well). Some thought might be appropriate here to not muddle the metadata description. - would vote for an additional optional method, “fitpredict”, which is fitting and prediction. Can add this later though. Regarding interfacing neural networks/deep learning: an architecture we’re following in another project is as follows, neural networks have a high-level and a low-level interface. High-level: summary hyper-parameters, such as number of layers, activation function, etc Low-level: hyper-parameter = the *entire code* to specify the NN architecture in the package (e.g., keras set-up) That is, the same package may have different interfaces allowing different levels of implementation detail. Obviously, high-level is more off-shelf but allows less customisation. Wonder whether there is a mid-level that makes sense here.

ablaom · 2018-12-12T11:18:16Z

@fkiraly

Yes, yes, in my hasty revisions some things have gotten muddled. Mia colpa.

multivariate + nominal should not be the same as multi-class! multivariate + nominal should be the same as multi-target – not the same as multi-class!

This is a mistake. "multiclass" means more that two classes for any target (usually one) "multivariate" means multiple targets.

on a similar note: you want to think very carefully about what the (back-end and front-end) return type of “probabilistic + multivariate” should be…

What do you suggest?

operations: should predict_proba not simply be predict of the model is probabilistic? Otherwise we’re schizophrenic in whether the meta-data describe a single thing (e.g., probabilistic binary classification), or multiple things the model can potentially do (deterministic binary classification, and probabilistic binary classification, and perhaps other things as well). Some thought might be appropriate here to not muddle the metadata description.

In my view a model might do multiple things and things and I agree that I have muddled the metadata description. I suggest returning to earlier formulation by dumping "probabilistic" as a descriptor of outputs. The "operations" key tells you if a probabilistic prediction is possible (by including "predict_proba" as a value, in addition to "predict"). See the the new "adding_new_models.md just pushed.

would vote for an additional optional method, “fitpredict”, which is fitting and prediction. Can add this later though.

Sure. That can be achieved trivially (and simultaneously for all models) at the level of the "machine interface" and is not needed for the model interface.

Regarding @fkiraly comments on NN. I agree, we should allow the user to interact with Flux at different levels of abstraction. As, I say, to start with, let's implement the low-level variety, i.e. user essentially specifies the whole code (which in Flux is not that bad) that generates the architecture (given, say some information about the data, say how many inputs).

Okay, I'd best be off to the airport.

fkiraly · 2018-12-12T13:55:45Z

on a similar note: you want to think very carefully about what the (back-end and front-end) return type of “probabilistic + multivariate” should be…

What do you suggest?

Answered this in #34.

In my view a model might do multiple things and things and I agree that I have muddled the metadata description. I suggest returning to earlier formulation by dumping "probabilistic" as a descriptor of outputs.

Would suggest against doing this!
This further muddles the interface in my opinion, by trying to infer from names of methods a struct has what it does.
"secret knowledge" is not a good API design principle...

ablaom · 2018-12-14T17:35:40Z

I don't see that the knowledge is "secret" since the information goes into the metadata. That is, if predict_proba is implemented for a model DecisionTreeClassifier, then I will have

metadata(DecisionTreeClassifier)["operations"] = ["predict", "predict_proba"]

That said, I would like to understand the alternative you are suggesting. Are you suggesting that instead of ONE model DecisionTreeClassifer with TWO methods predict and predict_proba, we instead have TWO models DecisionTreeClassifier and DecisionTreeClassifierProbabilistic each with a UNIQUE predict method, and write

metadata(DecisionTreeClassifier)["outputs_are"] = ["nominal", "multiclass"]
metadata(DecisionTreeProbabilistic)["outputs_are"] = ["nominal", "multiclass", "probabilistic"]

?

If so, what are the advantages, apart from being a little more explicit about purpose? Some disadvantages that I see are:

More models means more code duplication, or some other complication like type parameters or macros.
If I want both types of prediction, I have to train two separate models, or introduce a mechanism to convert one prediction type into another
An incongruence with the design elsewhere: For transformers (eg, standardizers) it makes far less sense to split a model into two - one for transforming and one for inverse-transforming - because I frequently will want to use both methods, and these methods share the same fit-result (and I cannot just use the output of one of the methods to get the output of the other, as in the case of classifier predictions).

If I have misunderstood your suggestion, can you please explain your alternative in more detail?

fkiraly · 2018-12-15T15:47:36Z

I think it would be cleaner if you have multiple models.

unless the interface is able to explicitly tell you that the DecisionTreeClassifier can do both (i.e., can have multiple metadata entries), this will imply the convention that all models with the probabilistic flag will have to implement both the probabilistic and the deterministic variant.

In the clean world, getting the determinstic prediction from thresholding is natural through attaching a target transformer, which introduces the "threshold" hyper-parameter.

If you want both types of prediction, the interface could recognize that you're asking the probabilistic model for a deterministic prediction, and automatically convert by applying the 1/2 thresholder say (which is not always the best solution regarding misclassification rate! A trained threshold on training data may be better). In such a case, only one method has to be defined.

The obvious problem with your solution is that you force users to provide the deterministic "predict" functionality, and this is usually done by 1/2 thresholding. The fact that here a choice is made is swept entirely under the rug, and it may force users to make a choice that is unnatural.

The last problem about design incongruence I do not see: I do not think this would imply you would have to split transformers. For classifiers, we are talking about tasks, i.e., what does it do. The classical classifier design solves two tasks in one. Whereas for a target transformer, transform-input and transform-output is part of solving a single task, so this perceived incongruence rather looks like an error of categories to me, sorry.

ablaom · 2019-03-18T18:13:57Z

Returning to the original issue, I would like to say I am rethinking the way in which deep-learning is to be integrated with MLJ.

I think we can have a more seamless integration of deep learning and the other paradigms after realising that once we have a general gradient descent tuning strategy (for suitably hyperparameters of pure Julia algorithms), then our learning networks (exported as models) are essentially generalisations of neural networks. Our tuning strategy will allow tuning of (possibly nested) hyperparameters of such "generalized networks" and to incorporate component models that are standard neural network architectures, we simply wrap them as models in which we declare the network weights (in a Flux chain for example) as hyperparameters, rather than learned parameters. Since these parameters completely determine the model, fit for these models essentially does nothing and the training of weights is externalised to MLJ (which has integrated the SVG optimiser and its variants as way of tuning the hyperparameters of any model.)

I admit this is a bit confusing at first. Is any of this making sense to others?

fkiraly · 2019-03-18T18:58:57Z

Um, wouldn't this mean replicating all the features of flux, and, at the same time, generalizing them?

This sounds like a project as large as mlj itself... I like the idea since it would allow you to build "learning networks" for arbitrarily specified (and arbitrarily complex) input/output combinations.

However, it also seems very ambitious given the current development team.
Maybe it would be helpful to get flux's opinion on this?

And in the interim, an integration which is seamless only on the interface level (rather than down to the full model specification) might be the way to go?

Also, in general, the neural network specific syntax of flux is very helpful for building neural network architectures, or retrieving default architectures. I don't think one would enjoy building a deep neural network by manually stitching together layers of GLM...

datnamer · 2019-03-18T19:02:45Z

Makes sense to me theoretically. Implementation feasibility aside, this was the insight behind the original Julia ML http://www.breloff.com/transformations/ where nodes/layers would be any transformation, including traditional learning algorithms.

Edit: On the other hand, flux isn't supposed to be just a bunch of layers...with the whole differentiable programming paradigm, it seems like MLJ is actually a subset of what can be expressed with flux. https://fluxml.ai/2019/02/07/what-is-differentiable-programming.html

fkiraly · 2019-03-18T19:21:14Z

@datnamer - why do you think MLJ is a subset of flux? I don't think currently the expressibility of one, in terms of modelling, is a subset of the other.

Of course both are Turing complete since you can write Julia in them, but I assume you mean this at the level of interface or composite construction?

Regarding transformations: yes, I think this is the right idea.
Though not every algorithm is an instance of "fitting the parameters by (regularised) gradient descent" - which seems to be a common misunderstanding of the deep learning age?

fkiraly · 2019-03-18T19:23:26Z

On a side note, did Breloff leave any design documents behind for transformations? Or, is there a paper? And, is he still actively developing?

ablaom · 2019-03-18T20:15:00Z

@fkiraly

I don't think one would enjoy building a deep neural network by manually stitching together layers of GLM...

No, no. We don't duplicate the Flux syntax. You define a Flux chain, using their nice syntax. Then you have a standard wrapper for such objects, allowing you to slot them in as component models in an MLJ "learning network" (which might include non-nn components). Only the training of the nn get's externalised to MLJ (by declaring the neural net weights as model hyperparameters, instead of regarding them as parameters to be learned by calling fit) not its' specification.

So syntax might look something like:

transformer = FluxWrapper(chain=Dense(100, 10, Flux.σ)) # dimension reducer
regressor = LightGBM(alpha=0.1) # a non-neural network model
composite = @pipeline transformer regressor

If you wrap composite in a SGD tuning strategy, and specify transformer.chain and regressor.alpha as the hyperparameters to be tuned, then fitting the tuned-model-wrap will simultaneously train the weights of the nn, and tune the regularisation parameter alpha of the regressor (assuming LightGBM is written in Julia, and so tune-able by SGD). For hyperparameters that cannot be tuned by SGD (eg, regressor.max_depth), you do separate tuning wraps (with different tuning strategies, such as grid search). As we have at present, these wraps could either be local (for tuning models individually) or global (to tune parameters in multiple component models simultaneously). Also, if we want a flux model to present more "conventionally", we could wrap it locally in a SGD tuning strategy, but there are times you might not want to do this.

fkiraly · 2019-03-18T20:34:55Z

Ah, I didn't think of that! You have a tuning wrapper which can generically tune SGD-fittable parameters and would fully interface with a composite model within? That would be genious (if it would work).

Also very interesting, since it looks like an instance of separating a "fitting strategy" from the "model specification" on the level of composites, which I don't quite understand how it could/should look generically.

fkiraly · 2019-03-18T20:36:10Z

Regarding the last bit, remembering an instance of this: Bayesians occasionally do this within the "probabilistic programming" paradigm. Though, of course, that only supports Bayesian style fitting...

ablaom · 2019-03-19T09:18:45Z

You have a tuning wrapper which can generically tune SGD-fittable parameters

Well, we don't have it yet, but getting this working has been, I understand, a goal that predates my involvement in MLJ. I think @tlienart had a go at this already for some restricted class of models. Flux already provides an AD module we can use. The problem that I see is that the parameters to be tuned (ie, with respect to which we want to differentiate) must be wrapped as "tracked" arrays, and I don't see how we can do this from outside the model (whose hyperparameter types are fixed). However I understand @MikeInnes has been working on changes to the AD engine that might make this easier. Perhaps he can clarify.

So, yes, this is still somewhat speculative. The main point I want to raise is that we should not be in a hurry to integrate neural networks in the naive way (fitted in isolation), if there is a more elegant solution around the corner.

fkiraly · 2019-03-19T11:04:15Z

@ablaom yes, with "you have" I meant, of course, "in the context of the plan/design which is not yet implemented".

Though I disagree with the conclusion: as you say the full "learning network" design is somewhat speculative, and would be one-of-a-kind (so who knows whether in the end it's brilliant or just a curiosity). Whereas integration-by-interface, e.g., a simple wrapper for a flux architecture specification, is not a lot of work, and the design is obvious.

Rephrasing, I wouldn't avoid integrating an important model class entirely because there's highly interesting (but risky) research to be done on it - especially since it seems it's a quick (but somewhat boring) job? Maybe there are volunteers...

datnamer · 2019-03-19T20:37:52Z

@fkiraly :

Of course both are Turing complete since you can write Julia in them, but I assume you mean this at the level of interface or composite construction?

Yes, that's what I meant. Flux models can be just functions soon as I think even the need for layers might be going away: https://github.com/FluxML/model-zoo/blob/notebooks/other/flux-next/intro.ipynb (@ablaom, you can take a look at that notebook to see where flux is going. The boilerplate for models is going to be reduced and tracker types will no longer be needed when Zygote.jl is ready for prime time. More info on the AD and compiler stuff here: https://drive.google.com/file/d/1tK4n3qQ5YsJkLc-8FEw5JMa90gHHfh3i/edit )

On a side note, did Breloff leave any design documents behind for transformations? Or, is there a paper? And, is he still actively developing?

You can find some discussion here: https://github.com/JuliaML/Roadmap.jl/issues and on the blog I linked. @Evizero might know more. I don't think @tbreloff is still working on Julia open source.

fkiraly · 2019-03-19T21:15:59Z

@datnamer thanks - the Roadmap.jl doesn't look like it has a full set of design or org documents, most issues are more like an eclectic feature wishlist? There are also partial designs which seem to focus on optimization based machine learning methods (?). A number of the thoughts might be useful.

The fate of Roadmap.jl also makes me think, @ablaom - perhaps at some point we may like to write up the key design decisions for the benefit of future generations, just in case we all get run over by the bus or something.

ablaom · 2020-06-22T03:00:56Z

Registration of the the new MLJFlux models with the MJL model registry: JuliaRegistries/General#16728

using MLJModels
julia> models("Flux")
4-element Array{NamedTuple{(:name, :package_name, :is_supervised, :docstring, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :is_pure_julia, :is_wrapper, :load_path, :package_license, :package_url, :package_uuid, :prediction_type, :supports_online, :supports_weights, :input_scitype, :target_scitype, :output_scitype),T} where T<:Tuple,1}:
 (name = ImageClassifier, package_name = MLJFlux, ... )                  
 (name = MultitargetNeuralNetworkRegressor, package_name = MLJFlux, ... )
 (name = NeuralNetworkClassifier, package_name = MLJFlux, ... )          
 (name = NeuralNetworkRegressor, package_name = MLJFlux, ... )

ysimillides added the help wanted Extra attention is needed label Dec 11, 2018

fkiraly mentioned this issue Dec 12, 2018

prediction API for probabilistic classifiers #34

Closed

ablaom mentioned this issue Dec 16, 2018

Implement MLJ interface for linear models #35

Closed

7 tasks

ablaom added the enhancement New feature or request label Feb 5, 2019

ablaom mentioned this issue Feb 17, 2019

API for models which can do multiple things (e.g., predict multiple output types) #81

Closed

ablaom added design discussion Discussing design issues and removed help wanted Extra attention is needed labels May 23, 2019

ablaom assigned ayush-1506 Jul 19, 2019

ablaom closed this as completed Jun 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate flux models #33

Integrate flux models #33

ysimillides commented Dec 11, 2018

ysimillides commented Dec 11, 2018

ayush1999 commented Dec 12, 2018 •

edited

ablaom commented Dec 12, 2018

fkiraly commented Dec 12, 2018 via email

ablaom commented Dec 12, 2018

fkiraly commented Dec 12, 2018

ablaom commented Dec 14, 2018

fkiraly commented Dec 15, 2018

ablaom commented Mar 18, 2019

fkiraly commented Mar 18, 2019

datnamer commented Mar 18, 2019 •

edited

fkiraly commented Mar 18, 2019 •

edited

fkiraly commented Mar 18, 2019 •

edited

ablaom commented Mar 18, 2019

fkiraly commented Mar 18, 2019

fkiraly commented Mar 18, 2019

ablaom commented Mar 19, 2019

fkiraly commented Mar 19, 2019

datnamer commented Mar 19, 2019 •

edited

fkiraly commented Mar 19, 2019

ablaom commented Jun 22, 2020

Integrate flux models #33

Integrate flux models #33

Comments

ysimillides commented Dec 11, 2018

ysimillides commented Dec 11, 2018

ayush1999 commented Dec 12, 2018 • edited

ablaom commented Dec 12, 2018

fkiraly commented Dec 12, 2018 via email

ablaom commented Dec 12, 2018

fkiraly commented Dec 12, 2018

ablaom commented Dec 14, 2018

fkiraly commented Dec 15, 2018

ablaom commented Mar 18, 2019

fkiraly commented Mar 18, 2019

datnamer commented Mar 18, 2019 • edited

fkiraly commented Mar 18, 2019 • edited

fkiraly commented Mar 18, 2019 • edited

ablaom commented Mar 18, 2019

fkiraly commented Mar 18, 2019

fkiraly commented Mar 18, 2019

ablaom commented Mar 19, 2019

fkiraly commented Mar 19, 2019

datnamer commented Mar 19, 2019 • edited

fkiraly commented Mar 19, 2019

ablaom commented Jun 22, 2020

ayush1999 commented Dec 12, 2018 •

edited

datnamer commented Mar 18, 2019 •

edited

fkiraly commented Mar 18, 2019 •

edited

fkiraly commented Mar 18, 2019 •

edited

datnamer commented Mar 19, 2019 •

edited