Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] usability improvements for a "minimal" pyarrow #38536

Open
h-vetinari opened this issue Nov 1, 2023 · 2 comments
Open

[Python] usability improvements for a "minimal" pyarrow #38536

h-vetinari opened this issue Nov 1, 2023 · 2 comments

Comments

@h-vetinari
Copy link
Contributor

Describe the enhancement requested

Providing slimmer variants of arrow has been a topic for quite a while, but got more urgent with pandas plan to depend on pyarrow, which would bring quite a substantial installation size increase, due to the way pyarrow gets packaged (this is true even more so in conda-forge, where we package a "maximal" version of arrow -- since it's so hard to build from source -- that generally contains more in terms of transitive dependencies than the wheels).

Through work on the feedstock, the conda-forge side of arrow is now ready to split up libarrow 14.0 into several pieces (currently libarrow-{acero,dataset,flight,flight-sql,gandiva,substrait} + libparquet), but we're still having pyarrow depend on the entirety of libarrow, not least because the python bindings link to everything but libarrow-flight-sql directly:

INFO (pyarrow,lib/python3.12/site-packages/pyarrow/_acero.cpython-312-x86_64-linux-gnu.so):           Needed DSO lib/libarrow_acero.so.1400     found in libarrow-acero-14.0.0-h59595ed_0_cpu
INFO (pyarrow,lib/python3.12/site-packages/pyarrow/_dataset.cpython-312-x86_64-linux-gnu.so):         Needed DSO lib/libarrow_dataset.so.1400   found in libarrow-dataset-14.0.0-h59595ed_0_cpu
INFO (pyarrow,lib/python3.12/site-packages/pyarrow/_dataset_orc.cpython-312-x86_64-linux-gnu.so):     Needed DSO lib/libarrow_dataset.so.1400   found in libarrow-dataset-14.0.0-h59595ed_0_cpu
INFO (pyarrow,lib/python3.12/site-packages/pyarrow/_dataset_parquet.cpython-312-x86_64-linux-gnu.so): Needed DSO lib/libarrow_dataset.so.1400   found in libarrow-dataset-14.0.0-h59595ed_0_cpu
INFO (pyarrow,lib/python3.12/site-packages/pyarrow/_flight.cpython-312-x86_64-linux-gnu.so):          Needed DSO lib/libarrow_flight.so.1400    found in libarrow-flight-14.0.0-h35bba4a_0_cpu
INFO (pyarrow,lib/python3.12/site-packages/pyarrow/_substrait.cpython-312-x86_64-linux-gnu.so):       Needed DSO lib/libarrow_substrait.so.1400 found in libarrow-substrait-14.0.0-hab2db56_0_cpu
INFO (pyarrow,lib/python3.12/site-packages/pyarrow/gandiva.cpython-312-x86_64-linux-gnu.so):          Needed DSO lib/libgandiva.so.1400         found in libarrow-gandiva-14.0.0-hacb8726_0_cpu
INFO (pyarrow,lib/python3.12/site-packages/pyarrow/libarrow_python_flight.so):                        Needed DSO lib/libarrow_flight.so.1400    found in libarrow-flight-14.0.0-h35bba4a_0_cpu
INFO (pyarrow,lib/python3.12/site-packages/pyarrow/libarrow_python.so):                               Needed DSO lib/libparquet.so.1400         found in libparquet-14.0.0-h352af49_0_cpu

While it would be theoretically possible to also build various pyarrow-* variants, that's quite unappealing IMO from a packaging perspective, and it would be nicer if pyarrow just depended on the (core) libarrow, but provided helpful error messages where any missing libarrow-* libraries actually get used. In such a scenario (c.f. discussion in conda-forge/arrow-cpp-feedstock#1035),

@h-vetinari: [...] I think in terms of user-friendliness, we need to provide a better message than:

ImportError: libarrow_dataset.so.1400: cannot open shared object file: No such file or directory

I think it would make sense for arrow to define which libraries can be removed while still expecting core functionality to work, which dependencies each remaining artefact has, and provide some error message upon not finding the respective library (which we can then patch on the feedstock to add messages like "install this additional package to get it"; alternatively, arrow could of course integrate that into the messages directly in this repo, à la "if you're using arrow from conda-forge, just install libarrow-dataset").

Such an approach would presumably also make it easier for the wheel side of things (i.e. not having N pyarrow-* variants), though of course, providing the equivalent of the libarrow-* outputs from conda-forge through wheels would be quite a headache. It's possible that the best solution for wheels looks different (or ends up being sliced differently, like e.g. having two wheels pyarrow and pyarrow-minimal, or pyarrow and pyarrow[full]).

Note also that (from conda-forge/arrow-cpp-feedstock#1175):

@h-vetinari: [...] the new libarrow core library still depends on some of the most heavy-weight libraries at runtime (e.g. libgoogle-cloud, which is around ~30MB). I think it would make sense to separate out the pieces that depend on cloud-provider bindings into a separate output. Not sure how much work that is...

This is now being tracked in #38309.

Component(s)

Packaging, Python

@jorisvandenbossche
Copy link
Member

it would be nicer if pyarrow just depended on the (core) libarrow, but provided helpful error messages where any missing libarrow-* libraries actually get used

That should already have been tackled (although there might be some libs that were missed): #36553 / #36554.
Of course, that's still a generic message, while in theory for conda-forge we could give a more useful error message pointing to the additional package they have to install. But I am not sure that's something we want to bake into pyarrow.

@h-vetinari
Copy link
Contributor Author

Ah, nice! I guess we can patch that, though it would be nice to have a template that we only need to patch once1, rather than have to chase every import that's potentially affected by a missing DSO.

Footnotes

  1. assuming pyarrow would not be willing to detect e.g. that the interpreter is from conda, and adapting the message

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants