Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Initial split for libarrow components #1175

Closed
wants to merge 28 commits into from

Conversation

raulcd
Copy link
Member

@raulcd raulcd commented Sep 15, 2023

Initial split for libarrow, libarrow-acero, libarrow-dataset, libarrow-flight, libarrow-flight-sql, libarrow-gandiva, libarrow-substrait and use them on pyarrow.

Checklist

  • Used a personal fork of the feedstock to propose changes
  • Bumped the build number (if the version is unchanged)
  • Reset the build number to 0 (if the version changed)
  • Re-rendered with the latest conda-smithy (Use the phrase @conda-forge-admin, please rerender in a comment in this PR for automated rerendering)
  • Ensured the license file is being packaged.

This is still under development pending tasks are:

  • Clean up individual package recipes to possibly remove duplicated requirements
  • Investigate reuse of already compiled libarrow instead of recompilation per component
  • Investigate whether we should modify ArrowOptions.cmake per component
  • Create new pyarrow-base component
  • Do we require libarrow-all?

Closes #1035

@conda-forge-webservices
Copy link

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

Copy link
Member

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First look - thanks for tackling this!

recipe/build-arrow.bat Outdated Show resolved Hide resolved
recipe/build-arrow.bat Outdated Show resolved Hide resolved
recipe/build-arrow.sh Outdated Show resolved Hide resolved
recipe/meta.yaml Outdated
Comment on lines 244 to 1017
- {{ pin_subpackage('libarrow', exact=True) }}
- {{ pin_subpackage('libarrow-dataset', exact=True) }}
- {{ pin_subpackage('libarrow-flight', exact=True) }}
- {{ pin_subpackage('libarrow-gandiva', exact=True) }}
- {{ pin_subpackage('libarrow-substrait', exact=True) }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there's a question hiding in there whether we want to provide a libarrow-all or something like that, which is just a metapackage that pulls together all the lib* components, and then pyarrow would host-depend on that.

@raulcd
Copy link
Member Author

raulcd commented Sep 16, 2023

@kou @jorisvandenbossche do you have any idea why only some of the registered functions for compute like array_filter would be available but some like equal are missing?
The error I am trying to solve:

ImportError while loading conftest '/home/conda/feedstock_root/build_artifacts/apache-arrow_1694796036002/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla/lib/python3.8/site-packages/pyarrow/conftest.py'.
pyarrow/conftest.py:92: in <module>
    import fastparquet  # noqa
fastparquet/__init__.py:4: in <module>
    from .writer import write, update_file_custom_metadata
fastparquet/writer.py:9: in <module>
    import pandas as pd
pandas/__init__.py:48: in <module>
    from pandas.core.api import (
pandas/core/api.py:27: in <module>
    from pandas.core.arrays import Categorical
pandas/core/arrays/__init__.py:1: in <module>
    from pandas.core.arrays.arrow import ArrowExtensionArray
pandas/core/arrays/arrow/__init__.py:1: in <module>
    from pandas.core.arrays.arrow.array import ArrowExtensionArray
pandas/core/arrays/arrow/array.py:83: in <module>
    "eq": pc.equal,
E   AttributeError: module 'pyarrow.compute' has no attribute 'equal'

and I can see:

~/code/arrow-cpp-feedstock2/build_artifacts/apache-arrow_1694796036002/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla 😀 (split-libarrow-components)  $ bin/python
Python 3.8.17 | packaged by conda-forge | (default, Jun 16 2023, 07:06:00) 
[GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow
>>> pyarrow.__path__
['/home/raulcd/code/arrow-cpp-feedstock2/build_artifacts/apache-arrow_1694796036002/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla/lib/python3.8/site-packages/pyarrow']
>>> from pyarrow import compute
>>> dir(compute)
['ArraySortOptions', 'AssumeTimezoneOptions', 'CastOptions', 'CountOptions', 'CumulativeOptions', 'CumulativeSumOptions', 'DayOfWeekOptions', 'DictionaryEncodeOptions', 'ElementWiseAggregateOptions', 'Expression', 'ExtractRegexOptions', 'FilterOptions', 'Function', 'FunctionOptions', 'FunctionRegistry', 'HashAggregateFunction', 'HashAggregateKernel', 'IndexOptions', 'JoinOptions', 'Kernel', 'ListSliceOptions', 'MakeStructOptions', 'MapLookupOptions', 'MatchSubstringOptions', 'ModeOptions', 'NullOptions', 'PadOptions', 'PairwiseOptions', 'PartitionNthOptions', 'QuantileOptions', 'RandomOptions', 'RankOptions', 'ReplaceSliceOptions', 'ReplaceSubstringOptions', 'RoundBinaryOptions', 'RoundOptions', 'RoundTemporalOptions', 'RoundToMultipleOptions', 'RunEndEncodeOptions', 'ScalarAggregateFunction', 'ScalarAggregateKernel', 'ScalarAggregateOptions', 'ScalarFunction', 'ScalarKernel', 'SelectKOptions', 'SetLookupOptions', 'SliceOptions', 'SortOptions', 'SplitOptions', 'SplitPatternOptions', 'StrftimeOptions', 'StrptimeOptions', 'StructFieldOptions', 'TDigestOptions', 'TakeOptions', 'TrimOptions', 'UdfContext', 'Utf8NormalizeOptions', 'VarianceOptions', 'VectorFunction', 'VectorKernel', 'WeekOptions', '_OptionsClassDoc', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', '_compute_docstrings', '_decorate_compute_function', '_get_arg_names', '_get_options_class', '_handle_options', '_make_generic_wrapper', '_make_global_functions', '_make_signature', '_scrape_options_class_doc', '_wrap_function', 'array_filter', 'array_take', 'bottom_k_unstable', 'call_function', 'call_tabular_function', 'cast', 'dedent', 'dictionary_encode', 'docscrape', 'drop_null', 'field', 'fill_null', 'filter', 'function_registry', 'get_function', 'index', 'indices_nonzero', 'inspect', 'list_functions', 'namedtuple', 'pa', 'random', 'register_aggregate_function', 'register_scalar_function', 'register_tabular_function', 'scalar', 'take', 'top_k_unstable', 'unique', 'value_counts', 'warnings']
>>> compute.equal
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'pyarrow.compute' has no attribute 'equal'

but some of the registered functions via _make_global_functions() are present:

>>> compute.array_filter()          
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/raulcd/code/arrow-cpp-feedstock2/build_artifacts/apache-arrow_1694796036002/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_pla/lib/python3.8/site-packages/pyarrow/compute.py", line 250, in wrapper
    raise TypeError(
TypeError: array_filter takes 2 positional argument(s), but 0 were given

In case you are curious the CI one I am trying to fix first are the Linux 64 ones:
https://dev.azure.com/conda-forge/feedstock-builds/_build/results?buildId=781413&view=logs&j=a4934dd9-cac8-50ae-1ee1-0f1db65c0032&t=b14c6e0e-8788-5a67-434e-f6c2c7a624a8&l=31262

edit: added link to CI

@raulcd
Copy link
Member Author

raulcd commented Sep 18, 2023

do you have any idea why only some of the registered functions for compute like array_filter would be available but some like equal are missing?

The issue seems to be due to basic kernels where we do not require ARROW_COMPUTE=ON:
https://github.com/apache/arrow/blob/main/cpp/src/arrow/CMakeLists.txt#L404
vs other kernels that require ARROW_COMPUTE:
https://github.com/apache/arrow/blob/main/cpp/src/arrow/CMakeLists.txt#L450
I've been able to solve the issue using ARROW_COMPUTE=ON, I am asking on Zulip to understand why some of those kernels like scalar_compare are not treated as "basic"

Copy link
Member

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another quick look; I don't think we need the full current test suite (for libarrow) duplicated to all outputs, can probably cut that down a tad.

recipe/build-arrow.sh Outdated Show resolved Hide resolved
Comment on lines +418 to +459
{% set libs = (cuda_compiler_version != "None") * ["arrow_cuda"] + [
"arrow", "arrow_acero"
] %}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the situation around libarrow_cuda.so? It's currently listed as an expected library in a bunch of outputs; if indeed those bits get mashed together in a single library, then these outputs would clobber each other...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this just because I am using libarrow as a run dependency? I can remove libarrow from the run dependency but as libarrow also contains all headers at the moment, I should probably stop testing for headers on that case.

@jorisvandenbossche
Copy link
Member

Potentially dumb question: could we "just" build Arrow C++ once with all components enabled, and then have a way to have each actual output package ship one of the shared libraries that was created with the initial build?

@h-vetinari
Copy link
Member

could we "just" build Arrow C++ once with all components enabled, and then have a way to have each actual output package ship one of the shared libraries

Yes, that works from a package-building POV, and actually what we should do, because we don't have the (CI) time to rebuild everything X times

@raulcd
Copy link
Member Author

raulcd commented Sep 19, 2023

could we "just" build Arrow C++ once with all components enabled, and then have a way to have each actual output package ship one of the shared libraries

Yes, that works from a package-building POV, and actually what we should do, because we don't have the (CI) time to rebuild everything X times

I was planning on investigating why / how to reuse the previously built component so we don't rebuild again from scratch but I can investigate how this "selection" of which shared libraries to ship per package can be done on the conda recipe. @h-vetinari do you know any other project that might do something similar where I could take inspiration? Otherwise I'll investigate how can I just pick the specific .so + headers on the specific package. It's my first time working on a conda recipe :)

@h-vetinari
Copy link
Member

@h-vetinari do you know any other project that might do something similar where I could take inspiration?

I think the best pattern for this (or at least: my personal favourite) is the one with installing into a temporary prefix, and then copying the required bits per output. See for example llvmdev.

The structure would then be:

  • build.sh / bld.bat: "global" build phase that builds libarrow with all the bells and whistles
  • install-libarrow.{sh,bat}: script that would ensure distributing the libs into the respective outputs
  • build-pyarrow.{sh,bat}: build pyarrow on top of libarrow (as now)

If copying the headers becomes a hassle, we can also consider having a libarrow-devel output that collects all the dependencies on the libarrow-* outputs, plus contains the headers. The advantage of this would be that we wouldn't be shipping lots of headers to endusers that only use the shared libs anyway. See also conda-forge/boost-feedstock#164 (bit of a sprawling PR, but for boost it's a real issue with 15k+ headers, so we don't want to ship them where not necessary)

@conda-forge-webservices
Copy link

Hi! This is the friendly automated conda-forge-linting service.

I just wanted to let you know that I linted all conda-recipes in your PR (recipe) and found it was in an excellent condition.

I do have some suggestions for making it better though...

For recipe:

  • It looks like the 'libarrow-all' output doesn't have any tests.

Copy link
Member

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reduce the host dependencies per output to the necessary minimum. The link check helps you on this. For example:

WARNING (libarrow-dataset): run-exports library package conda-forge::aws-crt-cpp-0.23.1-hf7d0843_2 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::aws-sdk-cpp-1.11.156-he6c2984_2 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::bzip2-1.0.8-h7f98852_4 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::glog-0.6.0-h6f12383_0 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::libabseil-20230802.1-cxx17_h59595ed_0 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::libbrotlidec-1.1.0-hd590300_0 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::libbrotlienc-1.1.0-hd590300_0 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::libgoogle-cloud-2.12.0-h8d7e28b_2 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::libthrift-0.19.0-h8fd135c_0 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::libutf8proc-2.8.0-h166bdaf_0 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::libzlib-1.2.13-hd590300_5 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::lz4-c-1.9.3-h9c3ff4c_1 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::orc-1.9.0-h52d3b3c_2 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::re2-2023.03.02-h8c504da_0 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::snappy-1.1.10-h9fff704_0 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::ucx-1.14.0-h3484d09_2 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)
WARNING (libarrow-dataset): run-exports library package conda-forge::zstd-1.5.5-hfc55251_0 in requirements/run but it is not used (i.e. it is overdepending or perhaps statically linked? If that is what you want then add it to `build/ignore_run_exports`)

means that libarrow-dataset has a large amount of unnecessary host dependencies.

This cuts both ways though, for example, libparquet also depends on openssl (presumably for the encryption facilities):

WARNING (libarrow,lib/libparquet.so.1300.0.0): Needed DSO lib/libcrypto.so.3 found in ['openssl']
WARNING (libarrow,lib/libparquet.so.1300.0.0): .. but ['openssl'] not in reqs/run, (i.e. it is overlinking) (likely) or a missing dependency (less likely)

(note that - despite the error message talking about reqs/run - this is still a question about host-dependencies, because if openssl is a host-dependency, the respective run-export will ensure it becomes a run-dependency as well).

recipe/meta.yaml Outdated Show resolved Hide resolved
- xsimd
- zlib
- zstd
- __cuda >=11.2 # [cuda_compiler_version != "None"]
Copy link
Member Author

@raulcd raulcd Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@h-vetinari this was raising an error locally if I used here __cuda >={{ cuda_compiler_version_min }} # [cuda_compiler_version != "None"]. The error when running python build-locally.py linux_64_cuda_compiler_versionNone:

RuntimeError: Received dictionary as spec.  Note that pip requirements are not supported in conda-build meta.yaml.  Error message: Invalid version '>=': invalid operator

Do you know if there is any issue where the global host dependency syntax would behave differently with this templating syntax?

edit: syntax

@h-vetinari
Copy link
Member

The dependencies are beginning to look better and better 👍

substrait is still missing a libprotobuf dependency, and pyarrow needs more (all?) of the libarrow subpackages. For finding these, I recommend opening the "raw log" in azure and then searching for WARNING (, which works on all platforms.

@raulcd
Copy link
Member Author

raulcd commented Oct 3, 2023

@h-vetinari I think I've fixed the different ones but I am not sure why pyarrow fails with finding the different libarrow related DSO's, example:

2023-10-03T11:36:33.1805882Z WARNING (pyarrow,lib/python3.9/site-packages/pyarrow/_dataset_orc.cpython-39-aarch64-linux-gnu.so): Needed DSO lib/libarrow_dataset.so.1300 found in ['libarrow-dataset']
2023-10-03T11:36:33.1816964Z WARNING (pyarrow,lib/python3.9/site-packages/pyarrow/_dataset_orc.cpython-39-aarch64-linux-gnu.so): .. but ['libarrow-dataset'] not in reqs/run, (i.e. it is overlinking) (likely) or a missing dependency (less likely)

shouldn't the libarrow-all dependency cover those? I might be misunderstanding how this works because I thought libarrow-all having:

      run:
        - {{ pin_subpackage('libarrow', exact=True) }}
        - {{ pin_subpackage('libparquet', exact=True) }}
        - {{ pin_subpackage('libarrow-acero', exact=True) }}
        - {{ pin_subpackage('libarrow-dataset', exact=True) }}
        - {{ pin_subpackage('libarrow-gandiva', exact=True) }}
        - {{ pin_subpackage('libarrow-substrait', exact=True) }}
        - {{ pin_subpackage('libarrow-flight', exact=True) }}
        - {{ pin_subpackage('libarrow-flight-sql', exact=True) }}

and pyarrow having both:

      host:
        - {{ pin_subpackage('libarrow-all', exact=True) }}

and

      run:
        - {{ pin_subpackage('libarrow-all', exact=True) }}

should contain the DSO's.

@h-vetinari
Copy link
Member

h-vetinari commented Oct 3, 2023

pyarrow having both [...] and [...] should contain the DSO's.

It's enough to have the DSO's (and so the package will be functional), but the metadata is off, because pyarrow does not run-depend on the outputs that contain the DSOs, and so the metadata is incomplete.

In this case, it wouldn't be an actual issue because pyarrow depends exactly on libarrow-all, which depends exactly on the right libs for the same build, but conda cannot assume that such transitive dependencies are actually ABI-safe.

In any case, the solution is simple: we need to add run-exports for all the sub-libraries to libarrow-all. This is necessary in any case for correctness of all those who chose to compile against libarrow-all.

In turn, you'll be able to remove the run-dependence on libarrow-all.

@h-vetinari
Copy link
Member

@raulcd, I took your changes here and rebased them on main, plus slimmed down the refactor where possible: #1201

This should more-or-less be ready to go for arrow 14 (should be in the final rounds of polishing). However, I noticed one crucial things w.r.t. #1035 - the new libarrow core library still depends on some of the most heavy-weight libraries at runtime (e.g. libgoogle-cloud, which is around ~30MB). I think it would make sense to separate out the pieces that depend on cloud-provider bindings into a separate output. Not sure how much work that is...

@raulcd
Copy link
Member Author

raulcd commented Oct 16, 2023

Thanks @h-vetinari ! And many thanks for your support when I was working on it!!

I know there are some conversations in order to create a libarrow-fs in order to split the filesystems away from Arrow core. Which will allow us to remove aws-sdk-cpp and libgoogle-cloud from libarrow. So, as far as I understand this is something that is on the "roadmap" but not sure when this will be done. Maybe @jorisvandenbossche @zeroshade have more idea as they have discussed this in the past.

@jorisvandenbossche
Copy link
Member

We have indeed chatted about that privately (and exactly with the motivation to be able to remove the large aws and libgoogle-cloud dependencies from the core libarrow package here). I opened apache/arrow#38309 to have a public record of this idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Package some form of "minimal" libarrow?
3 participants