Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R] Remove stringi dependency #5905

Closed
vspinu opened this issue Jul 17, 2020 · 12 comments
Closed

[R] Remove stringi dependency #5905

vspinu opened this issue Jul 17, 2020 · 12 comments

Comments

@vspinu
Copy link
Contributor

vspinu commented Jul 17, 2020

I am building a predictor for amazon lambda where the hard limit of all unpacked dependencies is 250MB. The xgboost + deps is already 92MB mostly due to the 56MB of stringi.

I have looked at the code-base briefly and all of the use-cases of stringi seem to be easily replaceable by base R functionally.

Would you be ok with that? I can have a look into a PR.

@trivialfis
Copy link
Member

trivialfis commented Jul 18, 2020

Sure. But are you compiling everything from source? If so, did you strip out the debug symbols in stringi and xgboost?

@trivialfis
Copy link
Member

trivialfis commented Jul 18, 2020

I noticed that for CPU only build, with gcc-9, the shared object in XGBoost is only 5.5MB without debug symbol:

cmake -DCMAKE_BUILD_TYPE=Release -DUSE_OPENMP=ON

@vspinu
Copy link
Contributor Author

vspinu commented Jul 18, 2020

I am installing from CRAN archive. This is how it looks:

g++ -std=gnu++11 -I"/home/vspinu/bin/R-4.0.0-bin/include" -DNDEBUG -I./include -I./dmlc-core/include -I./rabit/include -I. -DXGBOOST_STRICT_R_MODE=1 -DDMLC_LOG_BEFORE_THROW=0 -DDMLC_ENABLE_STD_THREAD=1 -DDMLC_DISABLE_STDIN=1 -DDMLC_LOG_CUSTOMIZE=1 -DXGBOOST_CUSTOMIZE_LOGGER=1 -DRABIT_CUSTOMIZE_MSG_ -DRABIT_STRICT_CXX98_  -I/usr/local/include  -fopenmp -DDMLC_CMAKE_LITTLE_ENDIAN=1 -pthread -fpic  -g -O2  -c xgboost_R.cc -o xgboost_R.o
g++ -std=gnu++11 -I"/home/vspinu/bin/R-4.0.0-bin/include" -DNDEBUG -I./include -I./dmlc-core/include -I./rabit/include -I. -DXGBOOST_STRICT_R_MODE=1 -DDMLC_LOG_BEFORE_THROW=0 -DDMLC_ENABLE_STD_THREAD=1 -DDMLC_DISABLE_STDIN=1 -DDMLC_LOG_CUSTOMIZE=1 -DXGBOOST_CUSTOMIZE_LOGGER=1 -DRABIT_CUSTOMIZE_MSG_ -DRABIT_STRICT_CXX98_  -I/usr/local/include  -fopenmp -DDMLC_CMAKE_LITTLE_ENDIAN=1 -pthread -fpic  -g -O2  -c xgboost_custom.cc -o xgboost_custom.o
gcc -I"/home/vspinu/bin/R-4.0.0-bin/include" -DNDEBUG -I./include -I./dmlc-core/include -I./rabit/include -I. -DXGBOOST_STRICT_R_MODE=1 -DDMLC_LOG_BEFORE_THROW=0 -DDMLC_ENABLE_STD_THREAD=1 -DDMLC_DISABLE_STDIN=1 -DDMLC_LOG_CUSTOMIZE=1 -DXGBOOST_CUSTOMIZE_LOGGER=1 -DRABIT_CUSTOMIZE_MSG_ -DRABIT_STRICT_CXX98_  -I/usr/local/include   -fpic  -g -O2  -c xgboost_assert.c -o xgboost_assert.o
gcc -I"/home/vspinu/bin/R-4.0.0-bin/include" -DNDEBUG -I./include -I./dmlc-core/include -I./rabit/include -I. -DXGBOOST_STRICT_R_MODE=1 -DDMLC_LOG_BEFORE_THROW=0 -DDMLC_ENABLE_STD_THREAD=1 -DDMLC_DISABLE_STDIN=1 -DDMLC_LOG_CUSTOMIZE=1 -DXGBOOST_CUSTOMIZE_LOGGER=1 -DRABIT_CUSTOMIZE_MSG_ -DRABIT_STRICT_CXX98_  -I/usr/local/include   -fpic  -g -O2  -c init.c -o init.o
g++ -std=gnu++11 -I"/home/vspinu/bin/R-4.0.0-bin/include" -DNDEBUG -I./include -I./dmlc-core/include -I./rabit/include -I. -DXGBOOST_STRICT_R_MODE=1 -DDMLC_LOG_BEFORE_THROW=0 -DDMLC_ENABLE_STD_THREAD=1 -DDMLC_DISABLE_STDIN=1 -DDMLC_LOG_CUSTOMIZE=1 -DXGBOOST_CUSTOMIZE_LOGGER=1 -DRABIT_CUSTOMIZE_MSG_ -DRABIT_STRICT_CXX98_  -I/usr/local/include  -fopenmp -DDMLC_CMAKE_LITTLE_ENDIAN=1 -pthread -fpic  -g -O2  -c amalgamation/xgboost-all0.cc -o amalgamation/xgboost-all0.o
In file included from ./dmlc-core/include/dmlc/././json.h:33:0,
                 from ./dmlc-core/include/dmlc/./parameter.h:26,
                 from ./dmlc-core/include/dmlc/registry.h:14,
                 from amalgamation/../src/metric/metric.cc:6,
                 from amalgamation/xgboost-all0.cc:13:

NDEBUG is there, should it be something else? I guess it's so big because of the inclusion of rabbit and dmlc, no?

In any case I am completely fine with 30MB object for xgboost. The real issue is stringi which is not a hard dependency; just used for plots and tree parser.

@trivialfis
Copy link
Member

It's the -g flag, which generates debug symbols.

@vspinu
Copy link
Contributor Author

vspinu commented Jul 18, 2020

Thanks for this. I halved the size of the deployment by compiling with -g0 O3. Could reduce stringi size to 33MB this way.

@trivialfis
Copy link
Member

Awesome! Now do you think it's worthy to remove stringi?

@trivialfis
Copy link
Member

trivialfis commented Jul 19, 2020

Although for XGBoost I highly recommend using CMake to build instead of changing flags yourself.

@vspinu
Copy link
Contributor Author

vspinu commented Jul 19, 2020

I would still say it's worth to remove stringi. It's used for non-critical stuff and it's a low hanging fruit anyhow. 33MB it's not a big deal for sure, but given that the entire R runtime with 23 packages on board is only 49MB it surely feels disproportionate.

I highly recommend using CMake to build instead of changing flags yourself.

It's not an option for us. We are using renv (pyenv counterpart for R) to snapshot the entire project. So dealing with custom installation would be a major complication.

@trivialfis
Copy link
Member

@vspinu Sure, PR is welcomed. ;-)

@trivialfis
Copy link
Member

@vspinu Not rushing into anything, any update?

@vspinu
Copy link
Contributor Author

vspinu commented Aug 12, 2020

Not yet. Busy month at work. Added to my reminders. Will provide a PR within a week or two. Thanks for pinging.

@trivialfis
Copy link
Member

No problem. Just following up. Have fun with it.

vspinu added a commit to vspinu/xgboost that referenced this issue Sep 10, 2020
@hcho3 hcho3 closed this as completed in 1453bee Sep 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants