Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I extract a fitresult that does not contain any of the original data? #186

Closed
DilumAluthge opened this issue Jul 20, 2019 · 3 comments

Comments

@DilumAluthge
Copy link
Member

DilumAluthge commented Jul 20, 2019

Summary

After fitting an MLJ model, I would like to extract the fitresult and export it for making future predictions on new data.

Sometimes, the fitresult object contains some or all of the original training data. (See the Example section below for a working example in which the fitresult contains all of the original training features and labels.) However, this is not necessary for making future predictions. In order to make future predictions on new data, I only need the actual fitted parameters (for example, the coefficients and intercepts of a linear model).

Is there a way that I can obtain a fitresult that only contains the actual fitted parameters and does not contain any of the original data?

Motivation

We have a secure server on which confidential data are stored. Because of legal requirements, the data cannot leave the server in any form.

I train machine learning models on the secure server. Once I have trained a model, I export the trained model (i.e. the fitted parameters) out of the secure server and onto a separate non-secure server. The trained models are stored on the non-secure server and are later transferred to a variety of other environments for use in making future predictions on new data.

Because of the restrictions surrounding the confidential data, I cannot export any trained model out of the secure server if it contains any of the original data. Therefore, I need a way to obtain a fitresult that does not contain any of the original features and labels.

Example

This example is taken from the "Building a simple learning network" example here and here.

julia> using MLJ, MLJBase, MLJModels

julia> @load RidgeRegressor
import MLJModels ✔
import MultivariateStats ✔
import MLJModels.MultivariateStats_.RidgeRegressor ✔

julia> mutable struct WrappedRidge <: DeterministicNetwork
           ridge_model
       end

julia> WrappedRidge(; ridge_model=RidgeRegressor) = WrappedRidge(ridge_model)
WrappedRidge

julia> function MLJ.fit(model::WrappedRidge, X, y)
           Xs = source(X)
           ys = source(y)

           stand_model = Standardizer()
           stand = machine(stand_model, Xs)
           W = transform(stand, Xs)

           box_model = UnivariateBoxCoxTransformer()  # for making data look normally-distributed
           box = machine(box_model, ys)
           z = transform(box, ys)

           ridge_model = model.ridge_model ###
           ridge =machine(ridge_model, W, z)
           zhat = predict(ridge, W)

           yhat = inverse_transform(box, zhat)
           fit!(yhat, verbosity=0)

           return yhat
       end

julia> task = load_boston()
SupervisedTask @ 792


julia> @show task.X[1:10, :Crim]
task.X[1:10, :Crim] = [0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829, 0.14455, 0.21124, 0.17004]
10-element Array{Float64,1}:
 0.00632
 0.02731
 0.02729
 0.03237
 0.06905
 0.02985
 0.08829
 0.14455
 0.21124
 0.17004

julia> @show task.y[1:10]
task.y[1:10] = [24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9]
10-element Array{Float64,1}:
 24.0
 21.6
 34.7
 33.4
 36.2
 28.7
 22.9
 27.1
 16.5
 18.9

julia> wrapped_model = WrappedRidge(ridge_model=RidgeRegressor(lambda=0.1))
WrappedRidge(ridge_model = RidgeRegressor @ 389,) @ 117

julia> mach = machine(wrapped_model, task)
Machine{WrappedRidge} @ 991


julia> fit!(mach; rows = :)
[ Info: Training Machine{WrappedRidge} @ 991.
Machine{WrappedRidge} @ 991


julia> @show mach.fitresult.nodes[1].data[1:10, :Crim]
((mach.fitresult).nodes[1]).data[1:10, :Crim] = [0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829, 0.14455, 0.21124, 0.17004]
10-element Array{Float64,1}:
 0.00632
 0.02731
 0.02729
 0.03237
 0.06905
 0.02985
 0.08829
 0.14455
 0.21124
 0.17004

julia> @show mach.fitresult.nodes[3].data[1:10]
((mach.fitresult).nodes[3]).data[1:10] = [24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9]
10-element Array{Float64,1}:
 24.0
 21.6
 34.7
 33.4
 36.2
 28.7
 22.9
 27.1
 16.5
 18.9

julia> using Test

julia> @test task.X == mach.fitresult.nodes[1].data
Test Passed

julia> @test task.y == mach.fitresult.nodes[3].data
Test Passed
@DilumAluthge DilumAluthge changed the title How can I extract a fitresult that does NOT contain any of the original data? How can I extract a fitresult that does not contain any of the original data? Jul 20, 2019
@ablaom
Copy link
Member

ablaom commented Jul 22, 2019

Thanks or bringing up this very interesting example. Regarding your queries:

After fitting an MLJ model, I would like to extract the fitresult and export it for making future predictions on new data.

Loading and saving machines is on the to-do list #138. When you save a machine the data fields (args) will be dumped; only the model and fitresult are saved.

Sometimes, the fitresult object contains some or all of the original training data. (See the Example section below for a working example in which the fitresult contains all of the original training features and labels.)

Naturally as the fitresult is a product of data, one cannot remove all traces of the data from the fitresult. In some extremes (some KNN models) the fitresult is the data.

As I understand it, in the health sciences, the approach to this problem is to anonymise the data before applying machine learning to it, or to have some kind of anonymising facade to the data. Is this possible in your case?

That said, I was surprised to see how all of the data remains accessible in the composite example you construct, which I had not realised. This is a concern, if only from the point of view of memory resources and needs investigating. This data you are accessing is not actually necessary for predicting (but is used to save computations when repeatedly training the network, as in tuning). Perhaps we should haves a purge method, automatically executed when machines are saved.

@ablaom
Copy link
Member

ablaom commented Aug 1, 2019

Okay, the issue of data appearing in the fitresult associated with models coming from learning networks has been resolved as follows:

  1. If the model is obtained using the new (undocumented) @from_network macro (see Example 1 below) then "data anonymization" is automatically taken care of (the fit methods now shift data at source nodes of the reconstructed network out to cache, and the update method temporarily shifts it back again to perform updates).

  2. If the model is obtained from a network "with your bare hands", as in the issue poster's example, then the problem remains, but there is a workaround using a new method anonymise!, as shown in Example 2 below. Unfortunately the "simple" fit method must be replaced by the full version, meaning verbosity must be given as an argument and the method returns a triple (fitresult, cache, report) instead of just a fitresult (the final node of the network).

Example 1

using CSV 
X, y = load_boston()();

@load RidgeRegressor

Xs = source(X) # or source(nothing)
ys = source(y) # or source(nothing)

stand_model = Standardizer()
stand = machine(stand_model, Xs)
W = transform(stand, Xs)

box_model = UnivariateBoxCoxTransformer()
box = machine(box_model, ys)
z = transform(box, ys)

ridge_model = RidgeRegressor()
ridge = machine(ridge_model, W, z)
zhat = predict(ridge, W)

yhat = inverse_transform(box, zhat)

wrapped_model = @from_network WrappedRidge(ridge_model=ridge_model) <= (Xs, ys, yhat)

mach = machine(wrapped_model, X, y)
fit!(mach; rows = :)
@show mach.fitresult.nodes[1].data

((mach.fitresult).nodes[1]).data = nothing    # <---------- no data here !

Example 2

using MLJ

@load RidgeRegressor

mutable struct WrappedRidge <: DeterministicNetwork
           ridge_model
end

WrappedRidge(; ridge_model=RidgeRegressor) = WrappedRidge(ridge_model)

function MLJ.fit(model::WrappedRidge, verbosity::Integer, X, y)
    Xs = source(X)
    ys = source(y)
    
    stand_model = Standardizer()
    stand = machine(stand_model, Xs)
    W = transform(stand, Xs)
    
    box_model = UnivariateBoxCoxTransformer()  # for making data look normally-distributed
    box = machine(box_model, ys)
    z = transform(box, ys)
    
    ridge_model = model.ridge_model ###
    ridge =machine(ridge_model, W, z)
    zhat = predict(ridge, W)
    
    yhat = inverse_transform(box, zhat)
    fit!(yhat, verbosity=0)

    cache = anonymize!(Xs, ys)
    
    return yhat, cache, nothing
end

using CSV 
task = load_boston()
wrapped_model = WrappedRidge(ridge_model=RidgeRegressor(lambda=0.1))
mach = machine(wrapped_model, task)
fit!(mach; rows = :)
@show mach.fitresult.nodes[1].data

((mach.fitresult).nodes[1]).data = nothing  # <----- no data here!

@ablaom ablaom closed this as completed Aug 1, 2019
@ablaom
Copy link
Member

ablaom commented Aug 14, 2019

Update: The new API for exporting learning networks (which automatically anonymise the fitresult) is here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants