Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] cut unnecessary files out of sdist package #6560

Closed
jameslamb opened this issue Dec 30, 2020 · 1 comment · Fixed by #6565
Closed

[python-package] cut unnecessary files out of sdist package #6560

jameslamb opened this issue Dec 30, 2020 · 1 comment · Fixed by #6565

Comments

@jameslamb
Copy link
Contributor

Hello from Chicago 👋

Recently in LightGBM, I've been working on reducing the size of our Python package's source distribution. I found that many extra files from git submodules were being bundled in the package. This can be problematic in storage sensitive environments. For example, the first time I tried to use lightgbm + pandas + scikit-learn together on AWS Lambda, I had to do some surgery to trim out unnecessary things, to avoid hitting the 250 MB limit for extra packages (see description of microsoft/LightGBM#3579 if you're curious).

Cutting the package size could also help PyPi's data transfer costs a little bit 😀

I cut the size of lightgbm's sdist package by making the rules in MANIFEST.in more specific, to target only the files that were needed. You can see the diffs for the PRs below:

I can see that there are some files in xgboost that are not necessary. For example, all of the dmlc-core unit test code and even dmlc-core's .git/ directory are currently bundled in the package produced by python setup.py sdist.

how I'm checking the contents of the package (click me)
# with a clone of the repo
git submodule update --recursive
cd python-package
python setup.py sdist
open xgboost.egg-info/SOURCES.txt

# or from PyPi
wget https://files.pythonhosted.org/packages/8e/cd/c1c48514cdd03d735d38d2de471474eb7adc53fc5278cb4a877a25a29976/xgboost-1.3.1.tar.gz -O xgboost.tar.gz

tar -xvf xgboost.tar.gz
open xgboost-1.3.1/xgboost.egg-info/SOURCES.txt

I'd be happy to do this same work for the xgboost Python package, making the MANIFEST.in rules more specific to trim out unnecessary files. Would you consider a PR that did something similar?

Thanks for your time and consideration.

@hcho3
Copy link
Collaborator

hcho3 commented Dec 30, 2020

@jameslamb Yes, it would be nice to remove unnecessary files from the source distribution. I will make sure to review your pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants