Create subsets of pyarrow package with pyarrow-core < pyarrow < pyarrow-all and update to Arrow v16.0.0 #1255

raulcd · 2023-11-27T11:56:56Z

Checklist

Used a personal fork of the feedstock to propose changes
Bumped the build number (if the version is unchanged)
Reset the build number to 0 (if the version changed)
Re-rendered with the latest conda-smithy (Use the phrase @conda-forge-admin, please rerender in a comment in this PR for automated rerendering)
Ensured the license file is being packaged.

As discussed in #1201 this PR tries to add a new pyarrow-base that only depends on libarrow and libparquet.

conda-forge-webservices · 2023-11-27T11:57:09Z

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

h-vetinari · 2023-11-27T12:34:25Z

As discussed in #1201 this PR tries to add a new pyarrow-base that only depends on libarrow and libparquet.

I had thought the idea was to just let pyarrow itself depend on libarrow and libparquet, and provide clear instructions in the error path about how to install the other libraries if necessary (e.g. by installing libarrow-X or libarrow-all)?

raulcd · 2023-11-27T13:57:35Z

I had thought the idea was to just let pyarrow itself depend on libarrow and libparquet, and provide clear instructions in the error path about how to install the other libraries if necessary (e.g. by installing libarrow-X or libarrow-all)?

From the discussions and the conversations I've had with @jorisvandenbossche I thought we wanted to provide both pyarrow-base and pyarrow.

Currently the error on pyarrow-base if we try to import, i.e, dataset is:

import: 'pyarrow.dataset'
Traceback (most recent call last):
  File "/home/conda/feedstock_root/build_artifacts/apache-arrow_1701073557860/test_tmp/run_test.py", line 5, in <module>
    import pyarrow.dataset
  File "/home/conda/feedstock_root/build_artifacts/apache-arrow_1701073557860/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla/lib/python3.10/site-packages/pyarrow/dataset.py", line 58, in <module>
    raise ImportError(
ImportError: The pyarrow installation is not built with support for 'dataset' (libarrow_dataset.so.1400: cannot open shared object file: No such file or directory)

I see your point and I am ok with both approaches, probably I would agree that 2 is a better future approach:

If we provide both pyarrow-base and pyarrow we are less disruptive with some of our current users that depend on "extra" libarrow packages.
If we provide only pyarrow depending only libarrow and libparquet we have a much cleaner installation going forward but with the current disruption to some of our users.

@jorisvandenbossche what are your thoughts?

recipe/build-pyarrow.bat

recipe/build-pyarrow.sh

jorisvandenbossche · 2023-11-30T10:17:42Z

I haven't yet looked in detail, but some quick drive-by thoughts on the package naming:

If we want to make (long term, eg also when filesystems can be split off) the most minimal arrow installation possible, shouldn't we also remove the libparquet dependency from the minimal package (whether it is pyarrow or pyarrow-base)?
If we have a general package that most users will install as the default, and where we want to include parquet, I would personally also include datasets (many features of the pyarrow.parquet.read_table(..) actually relies on dataset, like reading a partitioned dataset)
I think one advantage of keeping a "pyarrow" package that is not the most minimal package is that this is less disrupting now (as mentioned above), but also later if we want to further remove things from "libarrow"/"pyarrow-base".
For example when splitting off the compute kernels, that would again be a breaking change if then "pyarrow" wouldn't provide that anymore. Or for example when splitting off the filesystems, that would also break users assuming "pyarrow" comes with S3 support (although in this case there is something to say that this shouldn't be part of base "pyarrow", because then why S3 and not GC or Azure? And we might exactly want to avoid bundling them all in the default install, given you almost never need them all)

A potential alternative is something in between:

a pyarrow-base that is the absolute minimum (whatever minimum is possible at the moment), and people who rely on this package will more often have to update in case this minimum changes
a pyarrow package that includes some optional components to provide what we consider as a good "default" experience. It's of course debatable what this would be (and also this could change over time), but for example right now I would include parquet and datasets, but not gandiva and flight (because those are used much less, I think, and also have more complex dependencies). And for example when splitting off compute at some point, that could be something that keeps being included in this default pyarrow package.
when someone then wants the features of truly optional pieces like gandiva or flight (and potentially the filesystems in the future), at that point they definitely need to install the additional package explicitly.
One question I am wondering here is whether we also want to provide the names "pyarrow-gandiva" and "pyarrow-flight" as aliases of "libarrow-gandiva" etc to make it more user friendly for python users.

raulcd · 2023-11-30T16:27:06Z

If we want to make (long term, eg also when filesystems can be split off) the most minimal arrow installation possible, shouldn't we also remove the libparquet dependency from the minimal package (whether it is pyarrow or pyarrow-base)?

Currently we can't import pyarrow without libparquet:

import: 'pyarrow'
Traceback (most recent call last):
  File "/home/conda/feedstock_root/build_artifacts/apache-arrow_1701339661261/test_tmp/run_test.py", line 2, in <module>
    import pyarrow
  File "/home/conda/feedstock_root/build_artifacts/apache-arrow_1701339661261/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla/lib/python3.9/site-packages/pyarrow/__init__.py", line 65, in <module>
    import pyarrow.lib as _lib
ImportError: libparquet.so.1400: cannot open shared object file: No such file or directory

I've opened apache/arrow#39006

h-vetinari · 2023-11-30T23:25:52Z

some quick drive-by thoughts on the package naming:

I think some degree of breakage is unavoidable if we want to introduce a minimal pyarrow and default to it at some point. The question is whether the error messages that people encounter would be self-explanatory enough to trivially fix things.

A potential alternative is something in between:

If we really want to have not just pyarrow-base and also a pyarrow that doesn't contain absolutely everything, then we should have pyarrow-all (that would +/- match libarrow-all). I'm not 100% excited about this, because -base packages are often some packaging work-arounds (like breaking a cyclic build dependence), and not meant to be directly installed.

It wouldn't be the first package with such a setup though, for example ray has ray-core & ray-all as the smallest/largest collection, ray-default in between (ray was taken), and ray-* in for various flavours. Similarly for modin, which has modin-core < modin < modin-all (plus some modin-*).

So I guess it'd be reasonable to do pyarrow-core < pyarrow < pyarrow-all. Not sure how you'll match that with the PyPI side, but from the POV of conda, we have a lot of design flexibility in this regard.

ZupoLlask · 2024-01-13T13:25:08Z

Hi guys!

So how's this progressing? 👍

raulcd · 2024-01-15T11:00:31Z

So how's this progressing? 👍

I am currently working on releasing Arrow 15.0.0 which includes a fix to be able to extract libparquet.so from the minimal pyarrow. Once Arrow v15 is released I plan to re-work the PR to generate a pyarrow-core < pyarrow < pyarrow-all as discussed above. Probably I can start working on this in the next couple of weeks.

conda-forge-webservices · 2024-01-31T15:53:42Z

Hi! This is the friendly automated conda-forge-linting service.

I wanted to let you know that I linted all conda-recipes in your PR (recipe) and found some lint.

Here's what I've got...

For recipe:

Selectors are suggested to take a <two spaces>#<one space>[<expression>] form. See lines [709]

conda-forge-webservices · 2024-01-31T15:55:44Z

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

raulcd · 2024-02-01T10:22:55Z

@jorisvandenbossche @h-vetinari I have rebased this PR with 15.0.0 so libparquet is not required for pyarrow-core anymore and it is really only depending on libarrow.
At the moment I have created pyarrow-core < pyarrow < pyarrow-all.

pyarrow-core only contains libarrow as a run dependency.
pyarrow contains libarrow, libparquet, libarrow-acero, libarrow-dataset and libarrow-substrait (I am not entirely sure if we should also remove libarrow-substrait)
pyarrow-all apart from the pyarrow ones also contains as run dependencies libarrow-flight, libarrow-flight-sql and libgandiva.

I think this is good for a review at the moment.
Thanks!

ianmcook · 2024-04-04T15:43:10Z

@jorisvandenbossche @raulcd I think we need to have a discussion about this on the Arrow mailing lists before we push this change. The Arrow community currently has little awareness that this change is planned. I worry that there could be much surprise and frustration if we do not inform the community in advance.

raulcd · 2024-04-05T10:47:25Z

@ianmcook I think this is a valid concern but the only affected users will be those using flight, flight-sql and gandiva on pyarrow. When installing they will get a message like:

The pyarrow installation is not built with support for 'flight'

We could add a patch like:

diff --git a/python/pyarrow/flight.py b/python/pyarrow/flight.py
index b183690..12eef59 100644
--- a/python/pyarrow/flight.py
+++ b/python/pyarrow/flight.py
@@ -65,5 +65,6 @@ try:
     )
 except ImportError as exc:
     raise ImportError(
-        f"The pyarrow installation is not built with support for 'flight' ({str(exc)})"
+        "The 'pyarrow' installation is not built with support for " +
+        f"'flight'. Please install the conda-forge 'pyarrow-all' package. ({str(exc)})"
     ) from None

and users will get prompted something like:

The pyarrow installation is not built with support for 'flight'.  Please install the conda-forge 'pyarrow-all' package.

to be even more explicit.

We can send an email to the mailing list in order to give a heads up but in my opinion we could merge. We did the split of libarrow and libarrow-all in the past and didn't cause issues.

ianmcook · 2024-04-05T20:48:43Z

Thanks @raulcd — a message to the user@ and dev@ mailing lists would be great. The patch to give a more helpful error message would also be great.

recipe/meta.yaml

raulcd · 2024-04-23T10:05:44Z

@h-vetinari I was having a chat with @xhochy at PyCon DE and decided to add the Arrow update to v16.0.0 in the same PR so there is not a build without the split. I hope this is also good with you.

h-vetinari · 2024-04-23T10:36:08Z

That's exactly how I would have suggested doing it! :)

raulcd · 2024-04-26T16:30:02Z

It seems we hit this issue: protocolbuffers/protobuf#14576
trying the workaround proposed there by @h-vetinari

xhochy

Let's go!

h-vetinari

This is getting close but still needs a bit of work. I also want to do #1375 first to uncouple the aws-migrations from the arrow bump. I'm happy to rebase this PR on top afterwards (commit history needs some cleanup as well), and apply the necessary clean-ups.

Also, this hasn't been rerendered yet for the new outputs, which is going to further increase the rerender time and effectively break the bot from opening new migrations (see conda-forge/conda-forge-pinning-feedstock#5815). This is not an issue with this PR itself, but something we should fix beforehand, or at the very least ASAP afterwards.

recipe/meta.yaml

conda-forge-webservices · 2024-04-29T12:01:24Z

Hi! This is the friendly automated conda-forge-linting service.

I was trying to look for recipes to lint for you, but it appears we have a merge conflict.
Please try to merge or rebase with the base branch to resolve this conflict.

Please ping the 'conda-forge/core' team (using the @ notation in a comment) if you believe this is a bug.

raulcd · 2024-04-29T12:16:55Z

I've completely messed up with the rebase to update the commit history which has closed the PR. I've created a new PR here: #1376
Sorry about that :)

ianmcook · 2024-05-13T22:26:42Z

@raulcd is pyarrow.compute included in pyarrow-core?

h-vetinari · 2024-05-13T22:45:12Z

@raulcd is pyarrow.compute included in pyarrow-core?

Mostly yes, see here:

    test:
      imports:
        - pyarrow
        # Compute can be imported but the underlying libarrow_acero is not present.
        - pyarrow.compute

I'm not sure how usable it is without acero, that's for @raulcd to tell..

jorisvandenbossche · 2024-05-14T10:46:30Z

The compute kernels itself (pyarrow.compute) should be fully usable without acero being present AFAIK

jorisvandenbossche · 2024-05-14T10:50:26Z

(we have an issue about splitting the compute kernels into its own shared library (apache/arrow#25025), and that would allow it here to be installed separately as well, but until then I think the compute kernels are all included (or not, if one would disable building then) in the main libarrow.so)

raulcd requested review from cpcloud, h-vetinari, jakirkham, jreback, kou, kszucs, leifwalsh, nealrichardson, pcmoritz, pearu, pitrou, robertnishihara, siddharthteotia, wesm and xhochy as code owners November 27, 2023 11:56

xhochy requested changes Nov 27, 2023

View reviewed changes

recipe/build-pyarrow.bat Outdated Show resolved Hide resolved

recipe/build-pyarrow.sh Outdated Show resolved Hide resolved

raulcd mentioned this pull request Nov 30, 2023

FEEDBACK: PyArrow as a required dependency and PyArrow backed strings pandas-dev/pandas#54466

Open

raulcd force-pushed the add-pyarrow-base branch from c5a9ac2 to ecf8418 Compare January 31, 2024 15:53

raulcd changed the title ~~Add pyarrow-base~~ Create subsets of pyarrow package with pyarrow-base < pyarrow < pyarrow-all Feb 1, 2024

raulcd force-pushed the add-pyarrow-base branch from 7f62f77 to 5958885 Compare March 27, 2024 11:27

raulcd changed the title ~~Create subsets of pyarrow package with pyarrow-base < pyarrow < pyarrow-all~~ Create subsets of pyarrow package with pyarrow-core < pyarrow < pyarrow-all Apr 4, 2024

This was referenced Apr 9, 2024

[Python][Docs] Update documentation for installing split PyArrow packages from conda apache/arrow#41105

Closed

GH-41105: [Python][Docs] Update PyArrow installation docs for conda package split apache/arrow#41135

Merged

raulcd force-pushed the add-pyarrow-base branch from 5958885 to 3c6cc03 Compare April 23, 2024 09:47

h-vetinari reviewed Apr 23, 2024

View reviewed changes

recipe/meta.yaml Outdated Show resolved Hide resolved

raulcd changed the title ~~Create subsets of pyarrow package with pyarrow-core < pyarrow < pyarrow-all~~ Create subsets of pyarrow package with pyarrow-core < pyarrow < pyarrow-all and update to Arrow v16.0.0 Apr 23, 2024

raulcd mentioned this pull request Apr 23, 2024

[C++][FlightRPC] Downstream MINGW builds are failing to build arrow flight do to undefined reference apache/arrow#41333

Closed

raulcd force-pushed the add-pyarrow-base branch from 2999bbe to 6e21a4f Compare April 26, 2024 14:11

xhochy approved these changes Apr 27, 2024

View reviewed changes

h-vetinari requested changes Apr 29, 2024

View reviewed changes

raulcd force-pushed the add-pyarrow-base branch from b4195b8 to 031b47d Compare April 29, 2024 12:01

raulcd force-pushed the add-pyarrow-base branch from 031b47d to 6166f4c Compare April 29, 2024 12:06

raulcd closed this Apr 29, 2024

raulcd force-pushed the add-pyarrow-base branch from 6166f4c to 24fe27f Compare April 29, 2024 12:08

This was referenced May 3, 2024

Create subsets of pyarrow package with pyarrow-core < pyarrow < pyarrow-all and update to Arrow v16.0.0 #1376

Merged

[C++][CMake] handle Protobuf_PROTOC_EXECUTABLE correctly in cross-compilation apache/arrow#41581

Closed

WillAyd mentioned this pull request May 24, 2024

[C++] Meson Support for Arrow apache/arrow#41816

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create subsets of pyarrow package with pyarrow-core < pyarrow < pyarrow-all and update to Arrow v16.0.0 #1255

Create subsets of pyarrow package with pyarrow-core < pyarrow < pyarrow-all and update to Arrow v16.0.0 #1255

raulcd commented Nov 27, 2023 •

edited

Loading

conda-forge-webservices bot commented Nov 27, 2023

h-vetinari commented Nov 27, 2023

raulcd commented Nov 27, 2023

jorisvandenbossche commented Nov 30, 2023

raulcd commented Nov 30, 2023

h-vetinari commented Nov 30, 2023

ZupoLlask commented Jan 13, 2024

raulcd commented Jan 15, 2024

conda-forge-webservices bot commented Jan 31, 2024

conda-forge-webservices bot commented Jan 31, 2024

raulcd commented Feb 1, 2024

ianmcook commented Apr 4, 2024 •

edited

Loading

raulcd commented Apr 5, 2024

ianmcook commented Apr 5, 2024

raulcd commented Apr 23, 2024

h-vetinari commented Apr 23, 2024

raulcd commented Apr 26, 2024

xhochy left a comment

h-vetinari left a comment

conda-forge-webservices bot commented Apr 29, 2024

raulcd commented Apr 29, 2024

ianmcook commented May 13, 2024

h-vetinari commented May 13, 2024

jorisvandenbossche commented May 14, 2024

jorisvandenbossche commented May 14, 2024

Create subsets of pyarrow package with pyarrow-core < pyarrow < pyarrow-all and update to Arrow v16.0.0 #1255

Create subsets of pyarrow package with pyarrow-core < pyarrow < pyarrow-all and update to Arrow v16.0.0 #1255

Conversation

raulcd commented Nov 27, 2023 • edited Loading

conda-forge-webservices bot commented Nov 27, 2023

h-vetinari commented Nov 27, 2023

raulcd commented Nov 27, 2023

jorisvandenbossche commented Nov 30, 2023

raulcd commented Nov 30, 2023

h-vetinari commented Nov 30, 2023

ZupoLlask commented Jan 13, 2024

raulcd commented Jan 15, 2024

conda-forge-webservices bot commented Jan 31, 2024

conda-forge-webservices bot commented Jan 31, 2024

raulcd commented Feb 1, 2024

ianmcook commented Apr 4, 2024 • edited Loading

raulcd commented Apr 5, 2024

ianmcook commented Apr 5, 2024

raulcd commented Apr 23, 2024

h-vetinari commented Apr 23, 2024

raulcd commented Apr 26, 2024

xhochy left a comment

Choose a reason for hiding this comment

h-vetinari left a comment

Choose a reason for hiding this comment

conda-forge-webservices bot commented Apr 29, 2024

raulcd commented Apr 29, 2024

ianmcook commented May 13, 2024

h-vetinari commented May 13, 2024

jorisvandenbossche commented May 14, 2024

jorisvandenbossche commented May 14, 2024

raulcd commented Nov 27, 2023 •

edited

Loading

ianmcook commented Apr 4, 2024 •

edited

Loading