New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can I extract a fitresult
that does not contain any of the original data?
#186
Comments
fitresult
that does NOT contain any of the original data?fitresult
that does not contain any of the original data?
Thanks or bringing up this very interesting example. Regarding your queries:
Loading and saving machines is on the to-do list #138. When you save a machine the data fields (args) will be dumped; only the model and fitresult are saved.
Naturally as the fitresult is a product of data, one cannot remove all traces of the data from the fitresult. In some extremes (some KNN models) the fitresult is the data. As I understand it, in the health sciences, the approach to this problem is to anonymise the data before applying machine learning to it, or to have some kind of anonymising facade to the data. Is this possible in your case? That said, I was surprised to see how all of the data remains accessible in the composite example you construct, which I had not realised. This is a concern, if only from the point of view of memory resources and needs investigating. This data you are accessing is not actually necessary for predicting (but is used to save computations when repeatedly training the network, as in tuning). Perhaps we should haves a |
Okay, the issue of data appearing in the fitresult associated with models coming from learning networks has been resolved as follows:
Example 1 using CSV
X, y = load_boston()();
@load RidgeRegressor
Xs = source(X) # or source(nothing)
ys = source(y) # or source(nothing)
stand_model = Standardizer()
stand = machine(stand_model, Xs)
W = transform(stand, Xs)
box_model = UnivariateBoxCoxTransformer()
box = machine(box_model, ys)
z = transform(box, ys)
ridge_model = RidgeRegressor()
ridge = machine(ridge_model, W, z)
zhat = predict(ridge, W)
yhat = inverse_transform(box, zhat)
wrapped_model = @from_network WrappedRidge(ridge_model=ridge_model) <= (Xs, ys, yhat)
mach = machine(wrapped_model, X, y)
fit!(mach; rows = :)
@show mach.fitresult.nodes[1].data
((mach.fitresult).nodes[1]).data = nothing # <---------- no data here ! Example 2 using MLJ
@load RidgeRegressor
mutable struct WrappedRidge <: DeterministicNetwork
ridge_model
end
WrappedRidge(; ridge_model=RidgeRegressor) = WrappedRidge(ridge_model)
function MLJ.fit(model::WrappedRidge, verbosity::Integer, X, y)
Xs = source(X)
ys = source(y)
stand_model = Standardizer()
stand = machine(stand_model, Xs)
W = transform(stand, Xs)
box_model = UnivariateBoxCoxTransformer() # for making data look normally-distributed
box = machine(box_model, ys)
z = transform(box, ys)
ridge_model = model.ridge_model ###
ridge =machine(ridge_model, W, z)
zhat = predict(ridge, W)
yhat = inverse_transform(box, zhat)
fit!(yhat, verbosity=0)
cache = anonymize!(Xs, ys)
return yhat, cache, nothing
end
using CSV
task = load_boston()
wrapped_model = WrappedRidge(ridge_model=RidgeRegressor(lambda=0.1))
mach = machine(wrapped_model, task)
fit!(mach; rows = :)
@show mach.fitresult.nodes[1].data
((mach.fitresult).nodes[1]).data = nothing # <----- no data here! |
Update: The new API for exporting learning networks (which automatically anonymise the fitresult) is here |
Summary
After fitting an MLJ model, I would like to extract the
fitresult
and export it for making future predictions on new data.Sometimes, the
fitresult
object contains some or all of the original training data. (See the Example section below for a working example in which thefitresult
contains all of the original training features and labels.) However, this is not necessary for making future predictions. In order to make future predictions on new data, I only need the actual fitted parameters (for example, the coefficients and intercepts of a linear model).Is there a way that I can obtain a
fitresult
that only contains the actual fitted parameters and does not contain any of the original data?Motivation
We have a secure server on which confidential data are stored. Because of legal requirements, the data cannot leave the server in any form.
I train machine learning models on the secure server. Once I have trained a model, I export the trained model (i.e. the fitted parameters) out of the secure server and onto a separate non-secure server. The trained models are stored on the non-secure server and are later transferred to a variety of other environments for use in making future predictions on new data.
Because of the restrictions surrounding the confidential data, I cannot export any trained model out of the secure server if it contains any of the original data. Therefore, I need a way to obtain a
fitresult
that does not contain any of the original features and labels.Example
This example is taken from the "Building a simple learning network" example here and here.
The text was updated successfully, but these errors were encountered: