Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline with XGBoost doesn't seem to serialize properly #794

Closed
timsetsfire opened this issue May 25, 2021 · 3 comments
Closed

Pipeline with XGBoost doesn't seem to serialize properly #794

timsetsfire opened this issue May 25, 2021 · 3 comments

Comments

@timsetsfire
Copy link

timsetsfire commented May 25, 2021

Describe the bug
I'm just getting into MLJ and Julia, so please forgive me if this is obvious, but I can't find any detail on this anywhere.

If I attempt to include XGBoost in a Pipeline, save does not seem to properly serialize the pipeline. Please see the code below for examples to reproduce.

To Reproduce
NON PIPELINE EXAMPLE
Using the following code generates two files at ./model1, specifically xgb.jlso and xgb.xgboost.model

import RDatasets
using MLJ
boston = RDatasets.dataset("MASS", "Boston"); # a DataFrame
y, X = unpack(boston, ==(:MedV), colname -> true);
XGBoostRegressor = @load XGBoostRegressor
xgb = XGBoostRegressor()
xgb_machine = machine(xgb, X, y)
train, test = partition(eachindex(y), 0.7, shuffle=true); # 70:30 split
fit!(xgb_machine, rows=train)
MLJ.save("model1/xgb.jlso", xgb_machine)

ml2 = machine("model1/xgb.jlso")
predict(ml2, X)

PIPELINE Example_
Using the following code generates one file at ./model2, specifically xgb.jlso

import RDatasets
using MLJ
boston = RDatasets.dataset("MASS", "Boston"); # a DataFrame
y, X = unpack(boston, ==(:MedV), colname -> true);
XGBoostRegressor = @load XGBoostRegressor
xgb = XGBoostRegressor()
pipe = @pipeline(xgb)
xgb_machine = machine(pipe, X, y)
train, test = partition(eachindex(y), 0.7, shuffle=true); # 70:30 split
fit!(xgb_machine, rows=train)
MLJ.save("model2/xgb_pipeline.jlso", xgb_machine)

ml2 = machine("model2/xgb_pipeline.jlso")
predict(ml2, X)

Expected behavior
I expect both of the above code snippets to work identically. The first non pipeline example returns predictions, the second example returns

```┌ Error: Failed to apply the operation predict to the machine Machine{XGBoostRegressor,…} @707, which receives it's data arguments from one or more nodes in a learning network. Possibly, one of these nodes is delivering data that is incompatible with the machine's model.
│ Model (XGBoostRegressor @439):
│ input_scitype = Table{var"#s45"} where var"#s45"<:(AbstractArray{var"#s13",1} where var"#s13"<:Continuous)
│ target_scitype =AbstractArray{Continuous,1}
│ output_scitype =Unknown

│ Incoming data:
│ arg of predict scitype
│ -------------------------------------------
│ Source @676 Table{Union{AbstractArray{Continuous,1}, AbstractArray{Count,1}}}

│ Learning network sources:
│ source scitype
│ -------------------------------------------
│ Source @676 Nothing
│ Source @800 Nothing
└ @ MLJBase /Users/timothy.whittaker/.julia/packages/MLJBase/00RAT/src/composition/learning_networks/nodes.jl:126
Call to XGBoost C function XGBoosterPredict failed: [18:06:35] /workspace/srcdir/xgboost/src/c_api/c_api.cc:498: DMatrix/Booster has not been intialized or has already been disposed.

Stacktrace:
[1] _apply(::Tuple{Node{Machine{MLJXGBoostInterface.XGBoostRegressor,true}},Machine{MLJXGBoostInterface.XGBoostRegressor,true}}, ::DataFrames.DataFrame; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /Users/timothy.whittaker/.julia/packages/MLJBase/00RAT/src/composition/learning_networks/nodes.jl:132
[2] _apply at /Users/timothy.whittaker/.julia/packages/MLJBase/00RAT/src/composition/learning_networks/nodes.jl:118 [inlined]
[3] Node at /Users/timothy.whittaker/.julia/packages/MLJBase/00RAT/src/composition/learning_networks/nodes.jl:113 [inlined]
[4] predict(::Pipeline254, ::NamedTuple{(:predict,),Tuple{Node{Machine{MLJXGBoostInterface.XGBoostRegressor,true}}}}, ::DataFrames.DataFrame) at /Users/timothy.whittaker/.julia/packages/MLJBase/00RAT/src/operations.jl:114
[5] predict(::Machine{Pipeline254,true}, ::DataFrames.DataFrame) at /Users/timothy.whittaker/.julia/packages/MLJBase/00RAT/src/operations.jl:83
[6] top-level scope at In[6]:2
[7] include_string(::Function, ::Module, ::String, ::String) at ./loading.jl:1091




**Additional context**
<!--
Add any other context about the problem here.
-->

**Versions**
<details>
Julia Version 1.5.4
Commit 69fcb5745b (2021-03-11 19:13 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)
Environment:
  JULIA_HOME = /Applications/Julia-1.5.app/Contents/Resources/julia
<!--
Please run the following snippet and paste the output here.
using MLJ

nothing prints when running using MLJ, but here is the details of my project.
BSON = "fbb218c0-5317-5bc6-957e-2ee96dd4b1f0"
CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
Flux = "587475ba-b771-5e3f-ad9e-33799f191a9c"
IJulia = "7073ff75-c697-5162-941a-fcdaad2a7d2a"
MLJ = "add582a8-e3ab-11e8-2d5e-e98b27df1bc7"
MLJDecisionTreeInterface = "c6f25543-311c-4c74-83dc-3ea6d1015661"
MLJModels = "d491faf4-2d78-11e9-2867-c94bc002c0b7"
MLJXGBoostInterface = "54119dfa-1dab-4055-a167-80440f4f7a91"
NearestNeighborModels = "636a865e-7cf4-491e-846c-de09b730eb36"
Pandas = "eadc2687-ae89-51f9-a5d9-86b5a6373a9c"
PyCall = "438e738f-606a-5dbb-bf0a-cddfbfd45ab0"
RDatasets = "ce6b1742-4840-55fa-b093-852dadbb1d8b"
Revise = "295af30f-e4ad-537b-8983-00126c2a3abe"
XGBoost = "009559a3-9522-5dbb-924b-0b6ed2b22bb9"
...
-->

</details>

<!-- Thanks for contributing! -->
@ablaom
Copy link
Member

ablaom commented May 25, 2021

Thanks for reporting. This is a known issue. For models that are not pure julia (eg, wrap c-code, as in XGBoost) one needs a custom serialization method implemented for the model. While this is done for XGBoost - so that stand-alone XGBoost models (machines) can be serialised, this currently may not extend to the case where the model is inside any kind of composite model.

There is an open issue to fix this in special cases, for example, for the the TunedModel wrapper (#678), but the general problem will take some time be addressed , as it is a little complicated.

One workaround would be for you to switch to a pure julia tree boosting algorithm. I recommend the EvoTrees.jl models, which in MLJ are called EvoTreeRegressor, EvoTreeClassifier, EvoTreeCount, EvoTreeGaussian. The package is being actively maintained and the authors seem quite open to feature requests if there is something missing you would like.

cc @jemrobinson

@timsetsfire
Copy link
Author

Thank you!! I will dig into EvoTree for sure. Have a great week!

@ablaom
Copy link
Member

ablaom commented May 26, 2021

Closing as tracked by #678

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants