How can I extract a `fitresult` that does not contain any of the original data? #186

DilumAluthge · 2019-07-20T22:41:11Z

Summary

After fitting an MLJ model, I would like to extract the fitresult and export it for making future predictions on new data.

Sometimes, the fitresult object contains some or all of the original training data. (See the Example section below for a working example in which the fitresult contains all of the original training features and labels.) However, this is not necessary for making future predictions. In order to make future predictions on new data, I only need the actual fitted parameters (for example, the coefficients and intercepts of a linear model).

Is there a way that I can obtain a fitresult that only contains the actual fitted parameters and does not contain any of the original data?

Motivation

We have a secure server on which confidential data are stored. Because of legal requirements, the data cannot leave the server in any form.

I train machine learning models on the secure server. Once I have trained a model, I export the trained model (i.e. the fitted parameters) out of the secure server and onto a separate non-secure server. The trained models are stored on the non-secure server and are later transferred to a variety of other environments for use in making future predictions on new data.

Because of the restrictions surrounding the confidential data, I cannot export any trained model out of the secure server if it contains any of the original data. Therefore, I need a way to obtain a fitresult that does not contain any of the original features and labels.

Example

This example is taken from the "Building a simple learning network" example here and here.

julia> using MLJ, MLJBase, MLJModels

julia> @load RidgeRegressor
import MLJModels ✔
import MultivariateStats ✔
import MLJModels.MultivariateStats_.RidgeRegressor ✔

julia> mutable struct WrappedRidge <: DeterministicNetwork
           ridge_model
       end

julia> WrappedRidge(; ridge_model=RidgeRegressor) = WrappedRidge(ridge_model)
WrappedRidge

julia> function MLJ.fit(model::WrappedRidge, X, y)
           Xs = source(X)
           ys = source(y)

           stand_model = Standardizer()
           stand = machine(stand_model, Xs)
           W = transform(stand, Xs)

           box_model = UnivariateBoxCoxTransformer()  # for making data look normally-distributed
           box = machine(box_model, ys)
           z = transform(box, ys)

           ridge_model = model.ridge_model ###
           ridge =machine(ridge_model, W, z)
           zhat = predict(ridge, W)

           yhat = inverse_transform(box, zhat)
           fit!(yhat, verbosity=0)

           return yhat
       end

julia> task = load_boston()
SupervisedTask @ 7…92


julia> @show task.X[1:10, :Crim]
task.X[1:10, :Crim] = [0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829, 0.14455, 0.21124, 0.17004]
10-element Array{Float64,1}:
 0.00632
 0.02731
 0.02729
 0.03237
 0.06905
 0.02985
 0.08829
 0.14455
 0.21124
 0.17004

julia> @show task.y[1:10]
task.y[1:10] = [24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9]
10-element Array{Float64,1}:
 24.0
 21.6
 34.7
 33.4
 36.2
 28.7
 22.9
 27.1
 16.5
 18.9

julia> wrapped_model = WrappedRidge(ridge_model=RidgeRegressor(lambda=0.1))
WrappedRidge(ridge_model = RidgeRegressor @ 3…89,) @ 1…17

julia> mach = machine(wrapped_model, task)
Machine{WrappedRidge} @ 9…91


julia> fit!(mach; rows = :)
[ Info: Training Machine{WrappedRidge} @ 9…91.
Machine{WrappedRidge} @ 9…91


julia> @show mach.fitresult.nodes[1].data[1:10, :Crim]
((mach.fitresult).nodes[1]).data[1:10, :Crim] = [0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829, 0.14455, 0.21124, 0.17004]
10-element Array{Float64,1}:
 0.00632
 0.02731
 0.02729
 0.03237
 0.06905
 0.02985
 0.08829
 0.14455
 0.21124
 0.17004

julia> @show mach.fitresult.nodes[3].data[1:10]
((mach.fitresult).nodes[3]).data[1:10] = [24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9]
10-element Array{Float64,1}:
 24.0
 21.6
 34.7
 33.4
 36.2
 28.7
 22.9
 27.1
 16.5
 18.9

julia> using Test

julia> @test task.X == mach.fitresult.nodes[1].data
Test Passed

julia> @test task.y == mach.fitresult.nodes[3].data
Test Passed

The text was updated successfully, but these errors were encountered:

ablaom · 2019-07-22T14:16:48Z

Thanks or bringing up this very interesting example. Regarding your queries:

After fitting an MLJ model, I would like to extract the fitresult and export it for making future predictions on new data.

Loading and saving machines is on the to-do list #138. When you save a machine the data fields (args) will be dumped; only the model and fitresult are saved.

Sometimes, the fitresult object contains some or all of the original training data. (See the Example section below for a working example in which the fitresult contains all of the original training features and labels.)

Naturally as the fitresult is a product of data, one cannot remove all traces of the data from the fitresult. In some extremes (some KNN models) the fitresult is the data.

As I understand it, in the health sciences, the approach to this problem is to anonymise the data before applying machine learning to it, or to have some kind of anonymising facade to the data. Is this possible in your case?

That said, I was surprised to see how all of the data remains accessible in the composite example you construct, which I had not realised. This is a concern, if only from the point of view of memory resources and needs investigating. This data you are accessing is not actually necessary for predicting (but is used to save computations when repeatedly training the network, as in tuning). Perhaps we should haves a purge method, automatically executed when machines are saved.

ablaom · 2019-08-01T05:45:49Z

Okay, the issue of data appearing in the fitresult associated with models coming from learning networks has been resolved as follows:

If the model is obtained using the new (undocumented) @from_network macro (see Example 1 below) then "data anonymization" is automatically taken care of (the fit methods now shift data at source nodes of the reconstructed network out to cache, and the update method temporarily shifts it back again to perform updates).
If the model is obtained from a network "with your bare hands", as in the issue poster's example, then the problem remains, but there is a workaround using a new method anonymise!, as shown in Example 2 below. Unfortunately the "simple" fit method must be replaced by the full version, meaning verbosity must be given as an argument and the method returns a triple (fitresult, cache, report) instead of just a fitresult (the final node of the network).

Example 1

using CSV 
X, y = load_boston()();

@load RidgeRegressor

Xs = source(X) # or source(nothing)
ys = source(y) # or source(nothing)

stand_model = Standardizer()
stand = machine(stand_model, Xs)
W = transform(stand, Xs)

box_model = UnivariateBoxCoxTransformer()
box = machine(box_model, ys)
z = transform(box, ys)

ridge_model = RidgeRegressor()
ridge = machine(ridge_model, W, z)
zhat = predict(ridge, W)

yhat = inverse_transform(box, zhat)

wrapped_model = @from_network WrappedRidge(ridge_model=ridge_model) <= (Xs, ys, yhat)

mach = machine(wrapped_model, X, y)
fit!(mach; rows = :)
@show mach.fitresult.nodes[1].data

((mach.fitresult).nodes[1]).data = nothing    # <---------- no data here !

Example 2

using MLJ

@load RidgeRegressor

mutable struct WrappedRidge <: DeterministicNetwork
           ridge_model
end

WrappedRidge(; ridge_model=RidgeRegressor) = WrappedRidge(ridge_model)

function MLJ.fit(model::WrappedRidge, verbosity::Integer, X, y)
    Xs = source(X)
    ys = source(y)
    
    stand_model = Standardizer()
    stand = machine(stand_model, Xs)
    W = transform(stand, Xs)
    
    box_model = UnivariateBoxCoxTransformer()  # for making data look normally-distributed
    box = machine(box_model, ys)
    z = transform(box, ys)
    
    ridge_model = model.ridge_model ###
    ridge =machine(ridge_model, W, z)
    zhat = predict(ridge, W)
    
    yhat = inverse_transform(box, zhat)
    fit!(yhat, verbosity=0)

    cache = anonymize!(Xs, ys)
    
    return yhat, cache, nothing
end

using CSV 
task = load_boston()
wrapped_model = WrappedRidge(ridge_model=RidgeRegressor(lambda=0.1))
mach = machine(wrapped_model, task)
fit!(mach; rows = :)
@show mach.fitresult.nodes[1].data

((mach.fitresult).nodes[1]).data = nothing  # <----- no data here!

ablaom · 2019-08-14T21:37:48Z

Update: The new API for exporting learning networks (which automatically anonymise the fitresult) is here

DilumAluthge changed the title ~~How can I extract a fitresult that does NOT contain any of the original data?~~ How can I extract a fitresult that does not contain any of the original data? Jul 20, 2019

ablaom closed this as completed Aug 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I extract a `fitresult` that does not contain any of the original data? #186

How can I extract a `fitresult` that does not contain any of the original data? #186

DilumAluthge commented Jul 20, 2019 •

edited

ablaom commented Jul 22, 2019

ablaom commented Aug 1, 2019

ablaom commented Aug 14, 2019

How can I extract a fitresult that does not contain any of the original data? #186

How can I extract a fitresult that does not contain any of the original data? #186

Comments

DilumAluthge commented Jul 20, 2019 • edited

Summary

Motivation

Example

ablaom commented Jul 22, 2019

ablaom commented Aug 1, 2019

ablaom commented Aug 14, 2019

How can I extract a `fitresult` that does not contain any of the original data? #186

How can I extract a `fitresult` that does not contain any of the original data? #186

DilumAluthge commented Jul 20, 2019 •

edited