Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-5082: [Python] Substantially reduce Python wheel package and install size #7334

Closed
wants to merge 12 commits into from

Conversation

wesm
Copy link
Member

@wesm wesm commented Jun 2, 2020

Current manylinux wheel packages on master:

  • .whl manylinux1 package is 61 MB
  • Installed size is 223 MB

This patch

  • .whl package is 15 MB
  • Installed size is 57 MB

That's more than a 4x size reduction. There's several things in this patch:

  • We no longer ship 2 copies of shared libraries in the wheels. We ship just the SO-versioned shared libraries now. Because this creates problems for linkers (-larrow -lparquet etc won't work as is), I added a function pyarrow.create_library_symlinks() that adds symlinks to the versioned shared libraries. This function has to be run once under user permissions that can write to the site-packages/pyarrow/ directory.
  • Gandiva is disabled. If we're going to ship Gandiva as a wheel, I think we should do it as an add-on pyarrow_gandiva package per ARROW-8518.
  • Environment variable PYARROW_INSTALL_TESTS added to not install pyarrow.tests, which is about 2.3MB uncompressed. I don't think we need to ship the tests in the wheels. Currently we are still shipping the tests.
  • Compiled Cython C++ source files are no longer included in the wheels.

@wesm
Copy link
Member Author

wesm commented Jun 2, 2020

@kszucs can you help me get this across the finish line?

@wesm wesm requested review from xhochy and pitrou June 2, 2020 22:39
@github-actions
Copy link

github-actions bot commented Jun 2, 2020

@pitrou
Copy link
Member

pitrou commented Jun 3, 2020

I added function that tries to create the necessary symlinks when you call pyarrow.get_library_dirs()

It's a bad idea to add side effects to a simple inquiry function. This should IMHO be in a separate function (e.g. create_library_symlinks).

I don't think we need to ship the tests in the wheels.

I'd like to keep them. It's helpful to test if an installation works correctly.

@wesm
Copy link
Member Author

wesm commented Jun 3, 2020

I'd like to keep them. It's helpful to test if an installation works correctly.

Couldn't we solve this problem another way? If a user wants to run the whole test suite locally they need more than just the pyarrow/tests directory. I don't think it's worth bloating the installs for an infrequent use case.

@kszucs
Copy link
Member

kszucs commented Jun 3, 2020

At first I didn't like that the tests are shipped with the packages, but later on I found it useful. It also worth mentioning that many of our packaging builds and CI tests run the pyarrow unittests using pytest --pyargs pyarrow after installation.

If we decide to remove the tests from the packages (I'd rather keep them though) please defer it to another pull request because we need to update more CI scripts.

@kszucs
Copy link
Member

kszucs commented Jun 3, 2020

I added function that tries to create the necessary symlinks when you call pyarrow.get_library_dirs()

It's a bad idea to add side effects to a simple inquiry function. This should IMHO be in a separate function (e.g. create_library_symlinks).

Agree with Antoine.

@kszucs
Copy link
Member

kszucs commented Jun 3, 2020

@kszucs can you help me get this across the finish line?

Yes, I'll ensure that the wheel packaging builds work properly.

@kszucs
Copy link
Member

kszucs commented Jun 3, 2020

@github-actions crossbow submit -g wheel

@github-actions
Copy link

github-actions bot commented Jun 3, 2020

Revision: 208bd8c

Submitted crossbow builds: ursa-labs/crossbow @ actions-283

Task Status
wheel-manylinux1-cp35m Azure
wheel-manylinux1-cp36m Azure
wheel-manylinux1-cp37m Azure
wheel-manylinux1-cp38 Azure
wheel-manylinux2010-cp35m Azure
wheel-manylinux2010-cp36m Azure
wheel-manylinux2010-cp37m Azure
wheel-manylinux2010-cp38 Azure
wheel-manylinux2014-cp35m Azure
wheel-manylinux2014-cp36m Azure
wheel-manylinux2014-cp37m Azure
wheel-manylinux2014-cp38 Azure
wheel-osx-cp35m TravisCI
wheel-osx-cp36m TravisCI
wheel-osx-cp37m TravisCI
wheel-osx-cp38 TravisCI
wheel-win-cp35m Appveyor
wheel-win-cp36m Appveyor
wheel-win-cp37m Appveyor
wheel-win-cp38 Appveyor

@kszucs
Copy link
Member

kszucs commented Jun 3, 2020

@github-actions crossbow submit -g wheel

@github-actions
Copy link

github-actions bot commented Jun 3, 2020

Revision: 2d89b44

Submitted crossbow builds: ursa-labs/crossbow @ actions-284

Task Status
wheel-manylinux1-cp35m Azure
wheel-manylinux1-cp36m Azure
wheel-manylinux1-cp37m Azure
wheel-manylinux1-cp38 Azure
wheel-manylinux2010-cp35m Azure
wheel-manylinux2010-cp36m Azure
wheel-manylinux2010-cp37m Azure
wheel-manylinux2010-cp38 Azure
wheel-manylinux2014-cp35m Azure
wheel-manylinux2014-cp36m Azure
wheel-manylinux2014-cp37m Azure
wheel-manylinux2014-cp38 Azure
wheel-osx-cp35m TravisCI
wheel-osx-cp36m TravisCI
wheel-osx-cp37m TravisCI
wheel-osx-cp38 TravisCI
wheel-win-cp35m Appveyor
wheel-win-cp36m Appveyor
wheel-win-cp37m Appveyor
wheel-win-cp38 Appveyor

@wesm
Copy link
Member Author

wesm commented Jun 3, 2020

from the packages (I'd rather keep them though) please defer it to another pull request because we need to update more CI scripts.

I'm fine with debating whether to ship the tests separately.

The people who benefit from being able to do pytest --pyargs pyarrow are primarily the developers (i.e. us), not the users. If we want to enable users to run the test suite locally on their machine, we should IMHO instead implement a function that downloads the tests along with the test dependencies (e.g. the testing data repos) and then executes them. If say < 0.1% of users ever need to do this, why should > 99.9% of users bear the burden?

@kszucs
Copy link
Member

kszucs commented Jun 3, 2020

I'm fine with debating whether to ship the tests separately.

The people who benefit from being able to do pytest --pyargs pyarrow are primarily the developers (i.e. us), not the users. If we want to enable users to run the test suite locally on their machine, we should IMHO instead implement a function that downloads the tests along with the test dependencies (e.g. the testing data repos) and then executes them. If say < 0.1% of users ever need to do this, why should > 99.9% of users bear the burden?

Hard to disagree with that argument :)

Either way I'm deferring it to a follow-up because it'll involve quite some CI and packaging updates.

@pitrou
Copy link
Member

pitrou commented Jun 3, 2020

If say < 0.1% of users ever need to do this, why should > 99.9% of users bear the burden?

It's between the burden for the user of two additional megabytes installed, vs. the burden for us of "implement [and maintain] a function that downloads the tests along with the test dependencies (e.g. the testing data repos) and then executes them".

For me it's a no-brainer to ship the tests with the wheels, but YMMV.

@wesm
Copy link
Member Author

wesm commented Jun 3, 2020

It's between the burden for the user of two additional megabytes installed, vs. the burden for us of

Well, the point is that we don't have to do it at all. At no point in the 52 months since Apache Arrow started do I recall a user running the test suite out of a wheel or asking about doing so. If this is truly something that people need to be able to do, maybe an interested party can contribute it to the project? We've already expressed that we are going to limit our investment of time in maintaining wheels, and, from what I can tell, smaller wheels -> fewer complaints (here a 3% savings isn't that compelling but given that people are trying to squeeze pyarrow into deployments in AWS lambda that have to be < 250 MB including all dependencies, every megabyte does indeed count).

OR (ARROW_PARQUET
AND CMAKE_CXX_COMPILER_ID STREQUAL "GNU"
AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS "4.9"))
OR ARROW_PARQUET)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In file included from /usr/local/include/thrift/TApplicationException.h:23,
                 from /arrow/cpp/src/parquet/thrift_internal.h:36,
                 from /arrow/cpp/src/parquet/column_reader.cc:47:
/usr/local/include/thrift/Thrift.h:45:10: fatal error: boost/utility/enable_if.hpp: No such file or directory
 #include <boost/utility/enable_if.hpp>

Seems like the parquet headers require boost headers as a transitive dependency.

@kszucs
Copy link
Member

kszucs commented Jun 3, 2020

@github-actions crossbow submit -g wheel

@github-actions
Copy link

github-actions bot commented Jun 3, 2020

Revision: 00d9c86

Submitted crossbow builds: ursa-labs/crossbow @ actions-285

Task Status
wheel-manylinux1-cp35m Azure
wheel-manylinux1-cp36m Azure
wheel-manylinux1-cp37m Azure
wheel-manylinux1-cp38 Azure
wheel-manylinux2010-cp35m Azure
wheel-manylinux2010-cp36m Azure
wheel-manylinux2010-cp37m Azure
wheel-manylinux2010-cp38 Azure
wheel-manylinux2014-cp35m Azure
wheel-manylinux2014-cp36m Azure
wheel-manylinux2014-cp37m Azure
wheel-manylinux2014-cp38 Azure
wheel-osx-cp35m TravisCI
wheel-osx-cp36m TravisCI
wheel-osx-cp37m TravisCI
wheel-osx-cp38 TravisCI
wheel-win-cp35m Appveyor
wheel-win-cp36m Appveyor
wheel-win-cp37m Appveyor
wheel-win-cp38 Appveyor

@kszucs
Copy link
Member

kszucs commented Jun 3, 2020

@github-actions crossbow submit wheel-osx-*

@github-actions
Copy link

github-actions bot commented Jun 3, 2020

Revision: 8dacdc9

Submitted crossbow builds: ursa-labs/crossbow @ actions-286

Task Status
wheel-osx-cp35m TravisCI
wheel-osx-cp36m TravisCI
wheel-osx-cp37m TravisCI
wheel-osx-cp38 TravisCI

@pitrou
Copy link
Member

pitrou commented Jun 3, 2020

Well, I won't argue too much about it. But at some point we had decided that wheels were too much of a burden for us, and now it seems we're going out of our way to please people. I'm not sure I understand the strategy.

@kszucs
Copy link
Member

kszucs commented Jun 3, 2020

@github-actions crossbow submit -g wheel

@github-actions
Copy link

github-actions bot commented Jun 3, 2020

Revision: 271b25b

Submitted crossbow builds: ursa-labs/crossbow @ actions-287

Task Status
wheel-manylinux1-cp35m Azure
wheel-manylinux1-cp36m Azure
wheel-manylinux1-cp37m Azure
wheel-manylinux1-cp38 Azure
wheel-manylinux2010-cp35m Azure
wheel-manylinux2010-cp36m Azure
wheel-manylinux2010-cp37m Azure
wheel-manylinux2010-cp38 Azure
wheel-manylinux2014-cp35m Azure
wheel-manylinux2014-cp36m Azure
wheel-manylinux2014-cp37m Azure
wheel-manylinux2014-cp38 Azure
wheel-osx-cp35m TravisCI
wheel-osx-cp36m TravisCI
wheel-osx-cp37m TravisCI
wheel-osx-cp38 TravisCI
wheel-win-cp35m Appveyor
wheel-win-cp36m Appveyor
wheel-win-cp37m Appveyor
wheel-win-cp38 Appveyor

@wesm
Copy link
Member Author

wesm commented Jun 3, 2020

Well, I won't argue too much about it. But at some point we had decided that wheels were too much of a burden for us, and now it seems we're going out of our way to please people. I'm not sure I understand the strategy.

Well, the wheels are being installed at least 6.5 million times per month (for point of reference, for pandas it's 22.7M) and so wheel use has an impact on the health and success of the open source project. My attitude is that we shouldn't feel too bad about "taking things away" from the wheels absent more enthusiastic maintainers. With a half day's labor I was able to shrink the wheels by 4x -- the tests / no tests thing wasn't the most significant change but I definitely don't want to be going out of our way to put things in the wheels or maintain special code to cater to wheel users under the present circumstances. I think at least this will stymie some of the pain for some period of time until perhaps more maintainers come out of the woodwork (or I can afford to recruit and hire them).

@wesm
Copy link
Member Author

wesm commented Jun 3, 2020

I made the pa.create_library_symlinks() an explicit opt-in. I'm wondering if we should add a test to make sure that building a C extension against the installed wheel works so this doesn't get broken in the future

@pitrou
Copy link
Member

pitrou commented Jun 3, 2020

I would say that whoever depends on that may want to add the required unit test.

@wesm
Copy link
Member Author

wesm commented Jun 3, 2020

Sounds fine to me, we can open a JIRA.

EDIT: https://issues.apache.org/jira/browse/ARROW-9033

@xhochy
Copy link
Member

xhochy commented Jun 4, 2020

I would say that whoever depends on that may want to add the required unit test.

FYI @fjetter

@kszucs
Copy link
Member

kszucs commented Jun 4, 2020

@github-actions crossbow submit -g wheel

@github-actions
Copy link

github-actions bot commented Jun 4, 2020

Revision: 599660e

Submitted crossbow builds: ursa-labs/crossbow @ actions-288

Task Status
wheel-manylinux1-cp35m Azure
wheel-manylinux1-cp36m Azure
wheel-manylinux1-cp37m Azure
wheel-manylinux1-cp38 Azure
wheel-manylinux2010-cp35m Azure
wheel-manylinux2010-cp36m Azure
wheel-manylinux2010-cp37m Azure
wheel-manylinux2010-cp38 Azure
wheel-manylinux2014-cp35m Azure
wheel-manylinux2014-cp36m Azure
wheel-manylinux2014-cp37m Azure
wheel-manylinux2014-cp38 Azure
wheel-osx-cp35m TravisCI
wheel-osx-cp36m TravisCI
wheel-osx-cp37m TravisCI
wheel-osx-cp38 TravisCI
wheel-win-cp35m Appveyor
wheel-win-cp36m Appveyor
wheel-win-cp37m Appveyor
wheel-win-cp38 Appveyor

@kszucs
Copy link
Member

kszucs commented Jun 4, 2020

@github-actions crossbow submit wheel-manylinux*

@github-actions
Copy link

github-actions bot commented Jun 4, 2020

Revision: 83b0c5f

Submitted crossbow builds: ursa-labs/crossbow @ actions-289

Task Status
wheel-manylinux1-cp35m Azure
wheel-manylinux1-cp36m Azure
wheel-manylinux1-cp37m Azure
wheel-manylinux1-cp38 Azure
wheel-manylinux2010-cp35m Azure
wheel-manylinux2010-cp36m Azure
wheel-manylinux2010-cp37m Azure
wheel-manylinux2010-cp38 Azure
wheel-manylinux2014-cp35m Azure
wheel-manylinux2014-cp36m Azure
wheel-manylinux2014-cp37m Azure
wheel-manylinux2014-cp38 Azure

@kszucs
Copy link
Member

kszucs commented Jun 5, 2020

@xhochy could you please give it a review?

%PYTHON_INTERPRETER% -c "import pyarrow.dataset" || exit /B

@rem %PYTHON_INTERPRETER% -c "import pyarrow.gandiva" || exit /B
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just delete this line

@wesm
Copy link
Member Author

wesm commented Jun 5, 2020

Merging. Thanks all.

@wesm wesm closed this in 120c21f Jun 5, 2020
@wesm wesm deleted the smaller-wheels-try-3 branch June 5, 2020 23:42
@francisco-hoo
Copy link

How can I download this ~15MB .whl version of PyArrow?
Just find the pyarrow-0.17.1-cp38-cp38-manylinux2014_x86_64.whl (63.8 MB) version.
Due to AWS Lambda-Layers limitations, this pull is essential... Thanks all!!

@pitrou
Copy link
Member

pitrou commented Jul 13, 2020

@francisco-hoo For now this is only available in nightly builds:
https://arrow.apache.org/docs/python/install.html#installing-nightly-packages

Soon we will release a 1.0.0 with those improvements included, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants