Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement MLJ interface for linear models #35

Closed
2 of 7 tasks
ablaom opened this issue Dec 15, 2018 · 6 comments
Closed
2 of 7 tasks

Implement MLJ interface for linear models #35

ablaom opened this issue Dec 15, 2018 · 6 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@ablaom
Copy link
Member

ablaom commented Dec 15, 2018

Would someone like to implement the new MLJ interface for linear models for which julia code already exists, including:

GLM.jl for many of these

Lasso.jl which needs upgrading from 0.6.

MultivarateStats

GLM

Lasso.jl

  • lasso regression
  • fused lasso
  • trend filter
  • gamma lasso

Multivariate stats

  • OLS
  • Ridge

Relevant:

@darenasc darenasc added the help wanted Extra attention is needed label Dec 15, 2018
@ablaom ablaom added the enhancement New feature or request label Dec 15, 2018
@ablaom
Copy link
Member Author

ablaom commented Dec 16, 2018

In response to an offer of help from @tlienart. Some details:

How about you put your implementation of the MLJ "model interface" for
GLM.jl in a module that lives in 'src/builtins/GLM.jl' (where we
currently have the toy "KNN.jl") although your code will probably more
resemble the MulitvariateStats.jl stub where I put the RidgeRegressor
model. (I think we will move away from lazily loaded interface
implementations; if it does not stay in builtins, your code might become a
separate package or, we might try to get GLM.jl to include your
interface in their code.)

I expect you will generally be generally be predicting probabilities
rather than actual target values (this will probably be done in the
RidgeRegressor as well, but isn't at present). There has been some very
recent discussion about exaclty what predict should return in these
cases; see

issue 34

and

issue 33

We will go with @fkiraly recommendations, which are not reflected in the adding_new_models.md document just yet; In particular:

  • if an algorithm predicts probablities, there is no need to implement
    a second predict method that predicts values (i.e., means or by applying
    threshold, etc). So only one predict method per model. (We will
    dump predict_proba)

  • the predict method will predict a vector of distribution-objects,
    one for each input pattern. (To get the probability of a specific
    outcome for the target one will need to call the object on the
    outcome of interest, as Franz explains in the first thread
    above. However, your interface isn't concerned with this.) I admit I
    haven't thought too much about the details of this yet but hopefully
    we can just use Distributions.jl for this purpose. I will be turning
    to this question first thing when I return from holiday in the new year.

Do keep in mind that in the case of nominal target data, the target
y will arrive to your model as a CategoricalArray which includes
levels in its pool that may or may not be actually be realized in
the data, but which need be incorporated in the distribution object
(with zero probability if they do not occur); see also the
adding_new_models.md doc.

Note that you will need a separate model for each kind of target data
/ response type because each model SomeModel can only have one value for
metadata(SomeModel)[:outputs_are]. (To the possible values "nominal", "ordinal", "multiclass"
and "multivariate" we will now add "probabilistic", meaning
probablities are to be predicted). So you might have these models:

GLMProbabilisticRegressor
GLMProbabilisticClassifier
GLMProbabilisticMulticlassClassifier

and limit the allowed options for the "family" and "link" options
accordingly. Perhaps not worry about models for multivariate targets
just now.

No need for R style "formula". Your model already gets separate input
X and target y and you fit to all input features (columns) of
X. Feature selection will be external to the model interface.

@ablaom ablaom closed this as completed Dec 16, 2018
@ablaom ablaom reopened this Dec 16, 2018
@ablaom
Copy link
Member Author

ablaom commented Dec 16, 2018

Oops. Closed by accident. :-)

@tlienart
Copy link
Collaborator

Ok, I'll start this and probably open a NO-MERGE PR for guidance while I get familiar with the interface and more comfortable with the goal

This was referenced Jan 22, 2019
@tlienart tlienart self-assigned this Jul 12, 2019
@xiaodaigh
Copy link
Contributor

xiaodaigh commented Aug 16, 2019

Is there an example on how to use linear models using MLJ.jl? Can anyone please show me a simple example of fitting y = ax+b where a and b are coefficients? E.g. in GLM it would be

using GLM
x = rand(100)
y = rand(100)
data = DataFrame(x=x, y=y)
lm(@formulat(y~x), data)

@tlienart
Copy link
Collaborator

tlienart commented Aug 17, 2019

Hello @xiaodaigh , there's an ongoing PR to interface with GLM models which should be merged next week I would think.

@ablaom
Copy link
Member Author

ablaom commented Aug 18, 2019

For now you can use OLS (ordinary least squares regressor) or RidgeRegressor. For example:

julia> using MLJ
julia> X = (x1=rand(100), x2=rand(100));   # input must be a Tables.jl compatible table
julia> y = rand(100)
julia> @load  OLSRegressor         # load code from external packages

julia> model =  OLSRegressor()   # instantiate model
OLSRegressor(fit_intercept = true,) @ 470

julia> mach = machine(model, X, y)  # bind model to train/evaluation data 
Machine{OLSRegressor} @ 197

julia> fit!(mach, rows=1:95)    # fit on selected rows
[ Info: Training Machine{OLSRegressor} @ 197.
Machine{OLSRegressor} @ 197

julia> predict(mach, rows=96:100) # get (probabilistic) predictions on some other rows
5-element Array{Distributions.Normal{Float64},1}:
 Distributions.Normal{Float64}=0.5573871503802207, σ=0.2789162731813959)
 Distributions.Normal{Float64}=0.5910371492542903, σ=0.2789162731813959)
 Distributions.Normal{Float64}=0.4871839625605999, σ=0.2789162731813959)
 Distributions.Normal{Float64}=0.6031116815100634, σ=0.2789162731813959)
 Distributions.Normal{Float64}=0.5461718402936951, σ=0.2789162731813959)

julia> predict_mean(mach, rows=1:5) # get point predictions
5-element Array{Float64,1}:
 0.5573871503802207
 0.5910371492542903
 0.4871839625605999
 0.6031116815100634
 0.5461718402936951

julia> predict_mean(mach, (x1=rand(4), x2=rand(4))) get point predictions on new input data
4-element Array{Float64,1}:
 0.5483367654825207
 0.5948051723537034
 0.4847273704563324
 0.5892571004039957

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants