Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create subsets of pyarrow package with pyarrow-core < pyarrow < pyarrow-all and update to Arrow v16.0.0 #1255

Closed
wants to merge 0 commits into from

Conversation

raulcd
Copy link
Member

@raulcd raulcd commented Nov 27, 2023

Checklist

  • Used a personal fork of the feedstock to propose changes
  • Bumped the build number (if the version is unchanged)
  • Reset the build number to 0 (if the version changed)
  • Re-rendered with the latest conda-smithy (Use the phrase @conda-forge-admin, please rerender in a comment in this PR for automated rerendering)
  • Ensured the license file is being packaged.

As discussed in #1201 this PR tries to add a new pyarrow-base that only depends on libarrow and libparquet.

@conda-forge-webservices
Copy link

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

@h-vetinari
Copy link
Member

As discussed in #1201 this PR tries to add a new pyarrow-base that only depends on libarrow and libparquet.

I had thought the idea was to just let pyarrow itself depend on libarrow and libparquet, and provide clear instructions in the error path about how to install the other libraries if necessary (e.g. by installing libarrow-X or libarrow-all)?

@raulcd
Copy link
Member Author

raulcd commented Nov 27, 2023

I had thought the idea was to just let pyarrow itself depend on libarrow and libparquet, and provide clear instructions in the error path about how to install the other libraries if necessary (e.g. by installing libarrow-X or libarrow-all)?

From the discussions and the conversations I've had with @jorisvandenbossche I thought we wanted to provide both pyarrow-base and pyarrow.

Currently the error on pyarrow-base if we try to import, i.e, dataset is:

import: 'pyarrow.dataset'
Traceback (most recent call last):
  File "/home/conda/feedstock_root/build_artifacts/apache-arrow_1701073557860/test_tmp/run_test.py", line 5, in <module>
    import pyarrow.dataset
  File "/home/conda/feedstock_root/build_artifacts/apache-arrow_1701073557860/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla/lib/python3.10/site-packages/pyarrow/dataset.py", line 58, in <module>
    raise ImportError(
ImportError: The pyarrow installation is not built with support for 'dataset' (libarrow_dataset.so.1400: cannot open shared object file: No such file or directory)

I see your point and I am ok with both approaches, probably I would agree that 2 is a better future approach:

  1. If we provide both pyarrow-base and pyarrow we are less disruptive with some of our current users that depend on "extra" libarrow packages.
  2. If we provide only pyarrow depending only libarrow and libparquet we have a much cleaner installation going forward but with the current disruption to some of our users.

@jorisvandenbossche what are your thoughts?

recipe/build-pyarrow.bat Outdated Show resolved Hide resolved
recipe/build-pyarrow.sh Outdated Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member

I haven't yet looked in detail, but some quick drive-by thoughts on the package naming:

  • If we want to make (long term, eg also when filesystems can be split off) the most minimal arrow installation possible, shouldn't we also remove the libparquet dependency from the minimal package (whether it is pyarrow or pyarrow-base)?
  • If we have a general package that most users will install as the default, and where we want to include parquet, I would personally also include datasets (many features of the pyarrow.parquet.read_table(..) actually relies on dataset, like reading a partitioned dataset)
  • I think one advantage of keeping a "pyarrow" package that is not the most minimal package is that this is less disrupting now (as mentioned above), but also later if we want to further remove things from "libarrow"/"pyarrow-base".
    For example when splitting off the compute kernels, that would again be a breaking change if then "pyarrow" wouldn't provide that anymore. Or for example when splitting off the filesystems, that would also break users assuming "pyarrow" comes with S3 support (although in this case there is something to say that this shouldn't be part of base "pyarrow", because then why S3 and not GC or Azure? And we might exactly want to avoid bundling them all in the default install, given you almost never need them all)

A potential alternative is something in between:

  • a pyarrow-base that is the absolute minimum (whatever minimum is possible at the moment), and people who rely on this package will more often have to update in case this minimum changes
  • a pyarrow package that includes some optional components to provide what we consider as a good "default" experience. It's of course debatable what this would be (and also this could change over time), but for example right now I would include parquet and datasets, but not gandiva and flight (because those are used much less, I think, and also have more complex dependencies). And for example when splitting off compute at some point, that could be something that keeps being included in this default pyarrow package.
  • when someone then wants the features of truly optional pieces like gandiva or flight (and potentially the filesystems in the future), at that point they definitely need to install the additional package explicitly.
    One question I am wondering here is whether we also want to provide the names "pyarrow-gandiva" and "pyarrow-flight" as aliases of "libarrow-gandiva" etc to make it more user friendly for python users.

@raulcd
Copy link
Member Author

raulcd commented Nov 30, 2023

If we want to make (long term, eg also when filesystems can be split off) the most minimal arrow installation possible, shouldn't we also remove the libparquet dependency from the minimal package (whether it is pyarrow or pyarrow-base)?

Currently we can't import pyarrow without libparquet:

import: 'pyarrow'
Traceback (most recent call last):
  File "/home/conda/feedstock_root/build_artifacts/apache-arrow_1701339661261/test_tmp/run_test.py", line 2, in <module>
    import pyarrow
  File "/home/conda/feedstock_root/build_artifacts/apache-arrow_1701339661261/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla/lib/python3.9/site-packages/pyarrow/__init__.py", line 65, in <module>
    import pyarrow.lib as _lib
ImportError: libparquet.so.1400: cannot open shared object file: No such file or directory

I've opened apache/arrow#39006

@h-vetinari
Copy link
Member

some quick drive-by thoughts on the package naming:

I think some degree of breakage is unavoidable if we want to introduce a minimal pyarrow and default to it at some point. The question is whether the error messages that people encounter would be self-explanatory enough to trivially fix things.

A potential alternative is something in between:

If we really want to have not just pyarrow-base and also a pyarrow that doesn't contain absolutely everything, then we should have pyarrow-all (that would +/- match libarrow-all). I'm not 100% excited about this, because -base packages are often some packaging work-arounds (like breaking a cyclic build dependence), and not meant to be directly installed.

It wouldn't be the first package with such a setup though, for example ray has ray-core & ray-all as the smallest/largest collection, ray-default in between (ray was taken), and ray-* in for various flavours. Similarly for modin, which has modin-core < modin < modin-all (plus some modin-*).

So I guess it'd be reasonable to do pyarrow-core < pyarrow < pyarrow-all. Not sure how you'll match that with the PyPI side, but from the POV of conda, we have a lot of design flexibility in this regard.

@ZupoLlask
Copy link

Hi guys!

So how's this progressing? 👍

@raulcd
Copy link
Member Author

raulcd commented Jan 15, 2024

So how's this progressing? 👍

I am currently working on releasing Arrow 15.0.0 which includes a fix to be able to extract libparquet.so from the minimal pyarrow. Once Arrow v15 is released I plan to re-work the PR to generate a pyarrow-core < pyarrow < pyarrow-all as discussed above. Probably I can start working on this in the next couple of weeks.

@conda-forge-webservices
Copy link

Hi! This is the friendly automated conda-forge-linting service.

I wanted to let you know that I linted all conda-recipes in your PR (recipe) and found some lint.

Here's what I've got...

For recipe:

  • Selectors are suggested to take a <two spaces>#<one space>[<expression>] form. See lines [709]

@conda-forge-webservices
Copy link

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

@raulcd
Copy link
Member Author

raulcd commented Feb 1, 2024

@jorisvandenbossche @h-vetinari I have rebased this PR with 15.0.0 so libparquet is not required for pyarrow-core anymore and it is really only depending on libarrow.
At the moment I have created pyarrow-core < pyarrow < pyarrow-all.

  • pyarrow-core only contains libarrow as a run dependency.
  • pyarrow contains libarrow, libparquet, libarrow-acero, libarrow-dataset and libarrow-substrait (I am not entirely sure if we should also remove libarrow-substrait)
  • pyarrow-all apart from the pyarrow ones also contains as run dependencies libarrow-flight, libarrow-flight-sql and libgandiva.

I think this is good for a review at the moment.
Thanks!

@raulcd raulcd changed the title Add pyarrow-base Create subsets of pyarrow package with pyarrow-base < pyarrow < pyarrow-all Feb 1, 2024
@raulcd raulcd changed the title Create subsets of pyarrow package with pyarrow-base < pyarrow < pyarrow-all Create subsets of pyarrow package with pyarrow-core < pyarrow < pyarrow-all Apr 4, 2024
@ianmcook
Copy link

ianmcook commented Apr 4, 2024

@jorisvandenbossche @raulcd I think we need to have a discussion about this on the Arrow mailing lists before we push this change. The Arrow community currently has little awareness that this change is planned. I worry that there could be much surprise and frustration if we do not inform the community in advance.

@raulcd
Copy link
Member Author

raulcd commented Apr 5, 2024

@ianmcook I think this is a valid concern but the only affected users will be those using flight, flight-sql and gandiva on pyarrow. When installing they will get a message like:

The pyarrow installation is not built with support for 'flight'

We could add a patch like:

diff --git a/python/pyarrow/flight.py b/python/pyarrow/flight.py
index b183690..12eef59 100644
--- a/python/pyarrow/flight.py
+++ b/python/pyarrow/flight.py
@@ -65,5 +65,6 @@ try:
     )
 except ImportError as exc:
     raise ImportError(
-        f"The pyarrow installation is not built with support for 'flight' ({str(exc)})"
+        "The 'pyarrow' installation is not built with support for " +
+        f"'flight'. Please install the conda-forge 'pyarrow-all' package. ({str(exc)})"
     ) from None

and users will get prompted something like:

The pyarrow installation is not built with support for 'flight'.  Please install the conda-forge 'pyarrow-all' package.

to be even more explicit.

We can send an email to the mailing list in order to give a heads up but in my opinion we could merge. We did the split of libarrow and libarrow-all in the past and didn't cause issues.

@ianmcook
Copy link

ianmcook commented Apr 5, 2024

Thanks @raulcd — a message to the user@ and dev@ mailing lists would be great. The patch to give a more helpful error message would also be great.

recipe/meta.yaml Outdated Show resolved Hide resolved
@raulcd
Copy link
Member Author

raulcd commented Apr 23, 2024

@h-vetinari I was having a chat with @xhochy at PyCon DE and decided to add the Arrow update to v16.0.0 in the same PR so there is not a build without the split. I hope this is also good with you.

@raulcd raulcd changed the title Create subsets of pyarrow package with pyarrow-core < pyarrow < pyarrow-all Create subsets of pyarrow package with pyarrow-core < pyarrow < pyarrow-all and update to Arrow v16.0.0 Apr 23, 2024
@h-vetinari
Copy link
Member

That's exactly how I would have suggested doing it! :)

@raulcd
Copy link
Member Author

raulcd commented Apr 26, 2024

It seems we hit this issue: protocolbuffers/protobuf#14576
trying the workaround proposed there by @h-vetinari

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go!

Copy link
Member

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting close but still needs a bit of work. I also want to do #1375 first to uncouple the aws-migrations from the arrow bump. I'm happy to rebase this PR on top afterwards (commit history needs some cleanup as well), and apply the necessary clean-ups.

Also, this hasn't been rerendered yet for the new outputs, which is going to further increase the rerender time and effectively break the bot from opening new migrations (see conda-forge/conda-forge-pinning-feedstock#5815). This is not an issue with this PR itself, but something we should fix beforehand, or at the very least ASAP afterwards.

recipe/meta.yaml Outdated Show resolved Hide resolved
recipe/meta.yaml Outdated Show resolved Hide resolved
recipe/meta.yaml Outdated Show resolved Hide resolved
recipe/meta.yaml Outdated Show resolved Hide resolved
recipe/meta.yaml Outdated Show resolved Hide resolved
recipe/meta.yaml Outdated Show resolved Hide resolved
@conda-forge-webservices
Copy link

Hi! This is the friendly automated conda-forge-linting service.

I was trying to look for recipes to lint for you, but it appears we have a merge conflict.
Please try to merge or rebase with the base branch to resolve this conflict.

Please ping the 'conda-forge/core' team (using the @ notation in a comment) if you believe this is a bug.

@raulcd
Copy link
Member Author

raulcd commented Apr 29, 2024

I've completely messed up with the rebase to update the commit history which has closed the PR. I've created a new PR here: #1376
Sorry about that :)

@ianmcook
Copy link

@raulcd is pyarrow.compute included in pyarrow-core?

@h-vetinari
Copy link
Member

@raulcd is pyarrow.compute included in pyarrow-core?

Mostly yes, see here:

    test:
      imports:
        - pyarrow
        # Compute can be imported but the underlying libarrow_acero is not present.
        - pyarrow.compute

I'm not sure how usable it is without acero, that's for @raulcd to tell..

@jorisvandenbossche
Copy link
Member

The compute kernels itself (pyarrow.compute) should be fully usable without acero being present AFAIK

@jorisvandenbossche
Copy link
Member

(we have an issue about splitting the compute kernels into its own shared library (apache/arrow#25025), and that would allow it here to be installed separately as well, but until then I think the compute kernels are all included (or not, if one would disable building then) in the main libarrow.so)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants