Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] remove unnecessary files to reduce sdist size (fixes #6560) #6565

Merged
merged 1 commit into from Jan 2, 2021
Merged

[python-package] remove unnecessary files to reduce sdist size (fixes #6560) #6565

merged 1 commit into from Jan 2, 2021

Conversation

jameslamb
Copy link
Contributor

This PR fixes #6560.

Right now, the packages at https://pypi.org/project/xgboost/#files include some files like documentation and tests that I think could be safely removed. This PR proposes making the rules in MANIFEST.in more specific, to reduce the size of distributions of the Python package.

master this PR
sdist (compressed) 772K 664K
sdist (uncompressed) 4.2M 3.5M
how I calculated these sizes
pushd python-package
    rm -rf xgboost.egg-info
    rm -rf dist/
    rm -rf __pycache__
    rm -rf build/

    echo ""
    echo "building source distribution"
    echo ""
    python setup.py sdist > /dev/null
    cp xgboost.egg-info/SOURCES.txt ~/SOURCES.txt
    pushd dist/
        echo ""
        echo "sdist compressed size"
        echo ""
        du -a -h .
        tar -xf xgboost*.tar.gz
        rm xgboost*.tar.gz
        ls .
        echo ""
        echo "sdist uncompressed size"
        echo ""
        du -sh .
    popd
popd

You can check the list of files to be included by running the following:

cd python-package
python setup.py sdist
cat xgboost.egg-info/SOURCES.txt
included files as of this file
MANIFEST.in
README.rst
setup.cfg
setup.py
xgboost/CMakeLists.txt
xgboost/LICENSE
xgboost/VERSION
xgboost/__init__.py
xgboost/callback.py
xgboost/compat.py
xgboost/config.py
xgboost/core.py
xgboost/dask.py
xgboost/data.py
xgboost/libpath.py
xgboost/plotting.py
xgboost/py.typed
xgboost/rabit.py
xgboost/sklearn.py
xgboost/tracker.py
xgboost/training.py
xgboost.egg-info/PKG-INFO
xgboost.egg-info/SOURCES.txt
xgboost.egg-info/dependency_links.txt
xgboost.egg-info/not-zip-safe
xgboost.egg-info/requires.txt
xgboost.egg-info/top_level.txt
xgboost/cmake/Doc.cmake
xgboost/cmake/FindPrefetchIntrinsics.cmake
xgboost/cmake/Python_version.in
xgboost/cmake/Utils.cmake
xgboost/cmake/Version.cmake
xgboost/cmake/version_config.h.in
xgboost/cmake/xgboost-config.cmake.in
xgboost/cmake/xgboost.pc.in
xgboost/cmake/modules/FindNVML.cmake
xgboost/cmake/modules/FindNVTX.cmake
xgboost/cmake/modules/FindNccl.cmake
xgboost/dmlc-core/CMakeLists.txt
xgboost/dmlc-core/cmake/Utils.cmake
xgboost/dmlc-core/cmake/build_config.h.in
xgboost/dmlc-core/cmake/dmlc-config.cmake.in
xgboost/dmlc-core/cmake/Modules/FindHDFS.cmake
xgboost/dmlc-core/include/dmlc/any.h
xgboost/dmlc-core/include/dmlc/array_view.h
xgboost/dmlc-core/include/dmlc/base.h
xgboost/dmlc-core/include/dmlc/blockingconcurrentqueue.h
xgboost/dmlc-core/include/dmlc/build_config_default.h
xgboost/dmlc-core/include/dmlc/common.h
xgboost/dmlc-core/include/dmlc/concurrency.h
xgboost/dmlc-core/include/dmlc/concurrentqueue.h
xgboost/dmlc-core/include/dmlc/config.h
xgboost/dmlc-core/include/dmlc/data.h
xgboost/dmlc-core/include/dmlc/endian.h
xgboost/dmlc-core/include/dmlc/filesystem.h
xgboost/dmlc-core/include/dmlc/input_split_shuffle.h
xgboost/dmlc-core/include/dmlc/io.h
xgboost/dmlc-core/include/dmlc/json.h
xgboost/dmlc-core/include/dmlc/logging.h
xgboost/dmlc-core/include/dmlc/lua.h
xgboost/dmlc-core/include/dmlc/memory.h
xgboost/dmlc-core/include/dmlc/memory_io.h
xgboost/dmlc-core/include/dmlc/omp.h
xgboost/dmlc-core/include/dmlc/optional.h
xgboost/dmlc-core/include/dmlc/parameter.h
xgboost/dmlc-core/include/dmlc/recordio.h
xgboost/dmlc-core/include/dmlc/registry.h
xgboost/dmlc-core/include/dmlc/serializer.h
xgboost/dmlc-core/include/dmlc/strtonum.h
xgboost/dmlc-core/include/dmlc/thread_group.h
xgboost/dmlc-core/include/dmlc/thread_local.h
xgboost/dmlc-core/include/dmlc/threadediter.h
xgboost/dmlc-core/include/dmlc/timer.h
xgboost/dmlc-core/include/dmlc/type_traits.h
xgboost/dmlc-core/make/config.mk
xgboost/dmlc-core/make/dmlc.mk
xgboost/dmlc-core/src/config.cc
xgboost/dmlc-core/src/data.cc
xgboost/dmlc-core/src/io.cc
xgboost/dmlc-core/src/recordio.cc
xgboost/dmlc-core/src/data/basic_row_iter.h
xgboost/dmlc-core/src/data/csv_parser.h
xgboost/dmlc-core/src/data/disk_row_iter.h
xgboost/dmlc-core/src/data/libfm_parser.h
xgboost/dmlc-core/src/data/libsvm_parser.h
xgboost/dmlc-core/src/data/parser.h
xgboost/dmlc-core/src/data/row_block.h
xgboost/dmlc-core/src/data/text_parser.h
xgboost/dmlc-core/src/io/azure_filesys.cc
xgboost/dmlc-core/src/io/azure_filesys.h
xgboost/dmlc-core/src/io/cached_input_split.h
xgboost/dmlc-core/src/io/filesys.cc
xgboost/dmlc-core/src/io/hdfs_filesys.cc
xgboost/dmlc-core/src/io/hdfs_filesys.h
xgboost/dmlc-core/src/io/indexed_recordio_split.cc
xgboost/dmlc-core/src/io/indexed_recordio_split.h
xgboost/dmlc-core/src/io/input_split_base.cc
xgboost/dmlc-core/src/io/input_split_base.h
xgboost/dmlc-core/src/io/line_split.cc
xgboost/dmlc-core/src/io/line_split.h
xgboost/dmlc-core/src/io/local_filesys.cc
xgboost/dmlc-core/src/io/local_filesys.h
xgboost/dmlc-core/src/io/recordio_split.cc
xgboost/dmlc-core/src/io/recordio_split.h
xgboost/dmlc-core/src/io/s3_filesys.cc
xgboost/dmlc-core/src/io/s3_filesys.h
xgboost/dmlc-core/src/io/single_file_split.h
xgboost/dmlc-core/src/io/single_threaded_input_split.h
xgboost/dmlc-core/src/io/threaded_input_split.h
xgboost/dmlc-core/src/io/uri_spec.h
xgboost/dmlc-core/tracker/dmlc-submit
xgboost/dmlc-core/tracker/dmlc_tracker/__init__.py
xgboost/dmlc-core/tracker/dmlc_tracker/kubernetes.py
xgboost/dmlc-core/tracker/dmlc_tracker/launcher.py
xgboost/dmlc-core/tracker/dmlc_tracker/local.py
xgboost/dmlc-core/tracker/dmlc_tracker/mesos.py
xgboost/dmlc-core/tracker/dmlc_tracker/mpi.py
xgboost/dmlc-core/tracker/dmlc_tracker/opts.py
xgboost/dmlc-core/tracker/dmlc_tracker/sge.py
xgboost/dmlc-core/tracker/dmlc_tracker/slurm.py
xgboost/dmlc-core/tracker/dmlc_tracker/ssh.py
xgboost/dmlc-core/tracker/dmlc_tracker/submit.py
xgboost/dmlc-core/tracker/dmlc_tracker/tracker.py
xgboost/dmlc-core/tracker/dmlc_tracker/util.py
xgboost/dmlc-core/tracker/dmlc_tracker/yarn.py
xgboost/dmlc-core/tracker/yarn/build.bat
xgboost/dmlc-core/tracker/yarn/build.sh
xgboost/dmlc-core/tracker/yarn/pom.xml
xgboost/dmlc-core/tracker/yarn/src/main/java/org/apache/hadoop/yarn/dmlc/ApplicationMaster.java
xgboost/dmlc-core/tracker/yarn/src/main/java/org/apache/hadoop/yarn/dmlc/Client.java
xgboost/dmlc-core/tracker/yarn/src/main/java/org/apache/hadoop/yarn/dmlc/TaskRecord.java
xgboost/dmlc-core/windows/dmlc.sln
xgboost/dmlc-core/windows/dmlc/dmlc.vcxproj
xgboost/include/xgboost/base.h
xgboost/include/xgboost/c_api.h
xgboost/include/xgboost/data.h
xgboost/include/xgboost/feature_map.h
xgboost/include/xgboost/gbm.h
xgboost/include/xgboost/generic_parameters.h
xgboost/include/xgboost/global_config.h
xgboost/include/xgboost/host_device_vector.h
xgboost/include/xgboost/intrusive_ptr.h
xgboost/include/xgboost/json.h
xgboost/include/xgboost/json_io.h
xgboost/include/xgboost/learner.h
xgboost/include/xgboost/linear_updater.h
xgboost/include/xgboost/logging.h
xgboost/include/xgboost/metric.h
xgboost/include/xgboost/model.h
xgboost/include/xgboost/objective.h
xgboost/include/xgboost/parameter.h
xgboost/include/xgboost/predictor.h
xgboost/include/xgboost/span.h
xgboost/include/xgboost/tree_model.h
xgboost/include/xgboost/tree_updater.h
xgboost/include/xgboost/version_config.h
xgboost/plugin/CMakeLists.txt
xgboost/plugin/README.md
xgboost/plugin/dense_parser/dense_libsvm.cc
xgboost/plugin/example/README.md
xgboost/plugin/example/custom_obj.cc
xgboost/plugin/lz4/sparse_page_lz4_format.cc
xgboost/plugin/updater_gpu/README.md
xgboost/plugin/updater_oneapi/README.md
xgboost/plugin/updater_oneapi/predictor_oneapi.cc
xgboost/plugin/updater_oneapi/regression_loss_oneapi.h
xgboost/plugin/updater_oneapi/regression_obj_oneapi.cc
xgboost/rabit/CMakeLists.txt
xgboost/rabit/include/rabit/base.h
xgboost/rabit/include/rabit/c_api.h
xgboost/rabit/include/rabit/rabit.h
xgboost/rabit/include/rabit/serializable.h
xgboost/rabit/include/rabit/internal/engine.h
xgboost/rabit/include/rabit/internal/io.h
xgboost/rabit/include/rabit/internal/rabit-inl.h
xgboost/rabit/include/rabit/internal/socket.h
xgboost/rabit/include/rabit/internal/utils.h
xgboost/rabit/src/allreduce_base.cc
xgboost/rabit/src/allreduce_base.h
xgboost/rabit/src/allreduce_mock.h
xgboost/rabit/src/c_api.cc
xgboost/rabit/src/engine.cc
xgboost/rabit/src/engine_mock.cc
xgboost/rabit/src/engine_mpi.cc
xgboost/src/CMakeLists.txt
xgboost/src/cli_main.cc
xgboost/src/global_config.cc
xgboost/src/learner.cc
xgboost/src/logging.cc
xgboost/src/c_api/c_api.cc
xgboost/src/c_api/c_api.cu
xgboost/src/c_api/c_api_error.cc
xgboost/src/c_api/c_api_error.h
xgboost/src/common/base64.h
xgboost/src/common/bitfield.h
xgboost/src/common/categorical.h
xgboost/src/common/charconv.cc
xgboost/src/common/charconv.h
xgboost/src/common/column_matrix.h
xgboost/src/common/common.cc
xgboost/src/common/common.cu
xgboost/src/common/common.h
xgboost/src/common/compressed_iterator.h
xgboost/src/common/config.h
xgboost/src/common/device_helpers.cu
xgboost/src/common/device_helpers.cuh
xgboost/src/common/group_data.h
xgboost/src/common/hist_util.cc
xgboost/src/common/hist_util.cu
xgboost/src/common/hist_util.cuh
xgboost/src/common/hist_util.h
xgboost/src/common/host_device_vector.cc
xgboost/src/common/host_device_vector.cu
xgboost/src/common/io.cc
xgboost/src/common/io.h
xgboost/src/common/json.cc
xgboost/src/common/math.h
xgboost/src/common/observer.h
xgboost/src/common/probability_distribution.h
xgboost/src/common/quantile.cc
xgboost/src/common/quantile.cu
xgboost/src/common/quantile.cuh
xgboost/src/common/quantile.h
xgboost/src/common/random.cc
xgboost/src/common/random.h
xgboost/src/common/row_set.h
xgboost/src/common/survival_util.cc
xgboost/src/common/survival_util.h
xgboost/src/common/threading_utils.h
xgboost/src/common/timer.cc
xgboost/src/common/timer.h
xgboost/src/common/transform.h
xgboost/src/common/version.cc
xgboost/src/common/version.h
xgboost/src/data/adapter.h
xgboost/src/data/array_interface.h
xgboost/src/data/data.cc
xgboost/src/data/data.cu
xgboost/src/data/device_adapter.cuh
xgboost/src/data/ellpack_page.cc
xgboost/src/data/ellpack_page.cu
xgboost/src/data/ellpack_page.cuh
xgboost/src/data/ellpack_page_raw_format.cu
xgboost/src/data/ellpack_page_source.cc
xgboost/src/data/ellpack_page_source.cu
xgboost/src/data/ellpack_page_source.h
xgboost/src/data/iterative_device_dmatrix.cu
xgboost/src/data/iterative_device_dmatrix.h
xgboost/src/data/proxy_dmatrix.cu
xgboost/src/data/proxy_dmatrix.h
xgboost/src/data/simple_batch_iterator.h
xgboost/src/data/simple_dmatrix.cc
xgboost/src/data/simple_dmatrix.cu
xgboost/src/data/simple_dmatrix.h
xgboost/src/data/sparse_page_dmatrix.cc
xgboost/src/data/sparse_page_dmatrix.h
xgboost/src/data/sparse_page_raw_format.cc
xgboost/src/data/sparse_page_source.cc
xgboost/src/data/sparse_page_source.h
xgboost/src/data/sparse_page_writer.h
xgboost/src/gbm/gblinear.cc
xgboost/src/gbm/gblinear_model.cc
xgboost/src/gbm/gblinear_model.h
xgboost/src/gbm/gbm.cc
xgboost/src/gbm/gbtree.cc
xgboost/src/gbm/gbtree.h
xgboost/src/gbm/gbtree_model.cc
xgboost/src/gbm/gbtree_model.h
xgboost/src/linear/coordinate_common.h
xgboost/src/linear/linear_updater.cc
xgboost/src/linear/param.h
xgboost/src/linear/updater_coordinate.cc
xgboost/src/linear/updater_gpu_coordinate.cu
xgboost/src/linear/updater_shotgun.cc
xgboost/src/metric/elementwise_metric.cc
xgboost/src/metric/elementwise_metric.cu
xgboost/src/metric/metric.cc
xgboost/src/metric/metric_common.h
xgboost/src/metric/multiclass_metric.cc
xgboost/src/metric/multiclass_metric.cu
xgboost/src/metric/rank_metric.cc
xgboost/src/metric/rank_metric.cu
xgboost/src/metric/survival_metric.cc
xgboost/src/metric/survival_metric.cu
xgboost/src/objective/aft_obj.cc
xgboost/src/objective/aft_obj.cu
xgboost/src/objective/hinge.cc
xgboost/src/objective/hinge.cu
xgboost/src/objective/multiclass_obj.cc
xgboost/src/objective/multiclass_obj.cu
xgboost/src/objective/objective.cc
xgboost/src/objective/rank_obj.cc
xgboost/src/objective/rank_obj.cu
xgboost/src/objective/regression_loss.h
xgboost/src/objective/regression_obj.cc
xgboost/src/objective/regression_obj.cu
xgboost/src/predictor/cpu_predictor.cc
xgboost/src/predictor/gpu_predictor.cu
xgboost/src/predictor/predictor.cc
xgboost/src/tree/constraints.cc
xgboost/src/tree/constraints.cu
xgboost/src/tree/constraints.cuh
xgboost/src/tree/constraints.h
xgboost/src/tree/param.cc
xgboost/src/tree/param.h
xgboost/src/tree/split_evaluator.h
xgboost/src/tree/tree_model.cc
xgboost/src/tree/tree_updater.cc
xgboost/src/tree/updater_basemaker-inl.h
xgboost/src/tree/updater_colmaker.cc
xgboost/src/tree/updater_gpu_common.cuh
xgboost/src/tree/updater_gpu_hist.cu
xgboost/src/tree/updater_histmaker.cc
xgboost/src/tree/updater_prune.cc
xgboost/src/tree/updater_quantile_hist.cc
xgboost/src/tree/updater_quantile_hist.h
xgboost/src/tree/updater_refresh.cc
xgboost/src/tree/updater_sync.cc
xgboost/src/tree/gpu_hist/driver.cuh
xgboost/src/tree/gpu_hist/evaluate_splits.cu
xgboost/src/tree/gpu_hist/evaluate_splits.cuh
xgboost/src/tree/gpu_hist/feature_groups.cu
xgboost/src/tree/gpu_hist/feature_groups.cuh
xgboost/src/tree/gpu_hist/gradient_based_sampler.cu
xgboost/src/tree/gpu_hist/gradient_based_sampler.cuh
xgboost/src/tree/gpu_hist/histogram.cu
xgboost/src/tree/gpu_hist/histogram.cuh
xgboost/src/tree/gpu_hist/row_partitioner.cu
xgboost/src/tree/gpu_hist/row_partitioner.cuh

How this improves xgboost

This is valuable for storage-sensitive environments, like AWS Lambda. See my comment at pandas-dev/pandas#30741 (comment) for more explanation of that.

Reducing the package size can also help people who have slow download speeds, which I think is even more important now than it was in the past because the new pip resolver downloads source distributions (possibly many versions for one package) while resolving dependencies (pypa/pip#9187).

Thanks for your time and consideration!

@codecov-io
Copy link

codecov-io commented Jan 2, 2021

Codecov Report

Merging #6565 (9d6aa82) into master (fa13992) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #6565   +/-   ##
=======================================
  Coverage   80.20%   80.20%           
=======================================
  Files          13       13           
  Lines        3591     3591           
=======================================
  Hits         2880     2880           
  Misses        711      711           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fa13992...9d6aa82. Read the comment docs.

@trivialfis trivialfis merged commit 195a41c into dmlc:master Jan 2, 2021
@jameslamb jameslamb deleted the fix/sdist-size branch January 2, 2021 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[python-package] cut unnecessary files out of sdist package
4 participants