Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it necessary to install cudatoolkit with pyarrow 11 on Linux? #962

Closed
hoxbro opened this issue Feb 9, 2023 · 11 comments
Closed

Is it necessary to install cudatoolkit with pyarrow 11 on Linux? #962

hoxbro opened this issue Feb 9, 2023 · 11 comments
Labels

Comments

@hoxbro
Copy link

hoxbro commented Feb 9, 2023

Comment:

When installing pyarrow 11 on Linux cudatoolkit is installed. It is a pretty big dependency:

mamba create -n tmp python=3.10 pyarrow=11 --dry-run --offline 2>&1 | grep cuda
  + cudatoolkit               11.8.0  h37601d7_11          conda-forge/linux-64      667MB

This is not downloaded with pyarrow=10.

When installing the environment I can see this is because of ucx:

mamba repoquery whoneeds -t cudatoolkit

cudatoolkit[11.8.0]
  └─ ucx[1.12.1]
     └─ libarrow[11.0.0]
        ├─ arrow-cpp[11.0.0]
        │  └─ parquet-cpp[1.5.1]
        │     └─ pyarrow[11.0.0]
        └─ pyarrow already visited
@hoxbro hoxbro added the question label Feb 9, 2023
@h-vetinari
Copy link
Member

ucx indeed got added for arrow 11 (on the feedstock, haven't looked at backporting this yet), though I agree that cudatoolkit is a bit heavier than expected.

Are you using the CPU or CUDA-builds for arrow? I guess we could restrict it to the CUDA builds.

That said, there's a larger theme here that arrow keeps growing non-trivial dependencies. I guess we could introduce a separate output for a "minimal" arrow (libarrow-core?). Problem is that the layering is not obvious, i.e. a lot of things get compiled into the same shared library, and it's not clear that we could just separate things without building fully independent outputs.

CC @conda-forge/arrow-cpp

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Feb 16, 2023

I just ran into the same issue. This now gets installed for a normal CPU build (I was using a simple mamba install pyarrow on ubuntu).

I guess we could restrict it to the CUDA builds.

Is it on the arrow-cpp side that you would not depend on UCX for CPU builds, or it is that ucx itself should not depend on cudatoolkit for CPU builds?

It seems that ucx already has some logic around this: https://github.com/conda-forge/ucx-split-feedstock/blob/dd05d5af3a6c4902093ecff980564b6605633124/recipe/meta.yaml#L59-L60 (maybe I am interpreting it wrongly, but it seems it should already not depend on cudatoolkit if cuda_compiler_version is "None")

That said, there's a larger theme here that arrow keeps growing non-trivial dependencies. I guess we could introduce a separate output for a "minimal" arrow (libarrow-core?).

That said, there's a larger theme here that arrow keeps growing non-trivial dependencies. I guess we could introduce a separate output for a "minimal" arrow (libarrow-core?)

That's indeed something we have to look at long term, but probably good for a separate issue? (it seems to me that regardless of that, cudatoolkit should never be a dependency for the CPU version?)
libarrow itself actually already exists of multiple shared libraries (libarrow, libarrow_dataset, libarrow_flight, etc). So that could be a first way to split it into multiple packages that should be simpler (eg libarrow_flight has some extra dependencies that are not needed for libarrow itself, such as UCX). On the Arrow C++ side itself, we are also working on further splitting the core libarrow in more shared libraries.

@jorisvandenbossche
Copy link
Member

It seems that ucx already has some logic around this:

Digging a little bit further, it seems that it are older version of ucx that depends on cudatoolkit, the more recent versions indeed avoid it:

$ mamba repoquery depends ucx=1.12 -c conda-forge
 Name         Version Build       Channel             
───────────────────────────────────────────────────────
 ucx          1.12.1  h7507d65_0  conda-forge/linux-64
 libgcc-ng    9.5.0   hea2341a_17 conda-forge/linux-64
 libstdcxx-ng 9.5.0   hf86b28c_17 conda-forge/linux-64
 cudatoolkit  10.2.89 h713d32c_11 conda-forge/linux-64

$ mamba repoquery depends ucx -c conda-forge
 Name         Version Build       Channel             
───────────────────────────────────────────────────────
 ucx          1.13.1  h30ec399_0  conda-forge/linux-64
 libgcc-ng    12.2.0  h65d4601_19 conda-forge/linux-64
 libstdcxx-ng 12.2.0  h46fd767_19 conda-forge/linux-64

But so doing a pyarrow install is for some reason not getting the latest version by default:

$ mamba install pyarrow
...
  Package                    Version  Build                Channel                    Size
────────────────────────────────────────────────────────────────────────────────────────────
  Install:
────────────────────────────────────────────────────────────────────────────────────────────

  + arrow-cpp                 11.0.0  ha770c72_2_cpu       conda-forge/linux-64     Cached
  ...
  + cudatoolkit               11.8.0  h37601d7_11          conda-forge/linux-64     Cached
...
  + ucx                       1.12.1  h7a399c7_1           conda-forge/linux-64     Cached
...

But also asking for the latest version still gives a ucx build with cudatoolkit:

$ mamba install pyarrow ucx=1.13
...
  Package                    Version  Build                Channel                    Size
────────────────────────────────────────────────────────────────────────────────────────────
  Install:
────────────────────────────────────────────────────────────────────────────────────────────

  + arrow-cpp                 11.0.0  ha770c72_0_cpu       conda-forge/linux-64       30kB
 ...
  + cudatoolkit               11.8.0  h37601d7_11          conda-forge/linux-64     Cached
...
  + ucx                       1.13.1  h538f049_1           conda-forge/linux-64       17MB
...

Explicitly asking for the ucx build that doesn't depend on cudatoolkit:

$ mamba install pyarrow ucx=1.13.1=h30ec399_0
...
  Package                    Version  Build                Channel                    Size
────────────────────────────────────────────────────────────────────────────────────────────
  Install:
────────────────────────────────────────────────────────────────────────────────────────────

  + arrow-cpp                 11.0.0  ha770c72_0_cpu       conda-forge/linux-64       30kB
...
  + ucx                       1.13.1  h30ec399_0           conda-forge/linux-64       17MB
...

So it seems there is something wrong with the last ucx build? (the cudatoolkit dependency got added again?) conda-forge/ucx-split-feedstock#114 is the PR that bumped the build number, but I don't directly see how that changed this dependency.

@jorisvandenbossche
Copy link
Member

I opened conda-forge/ucx-split-feedstock#115 for the underlying issue with ucx (also installing ucx on its own has the same issue).

Until that is solved, temporarily removing the ucx dependency from libarrow might be the easiest workaround (since the dependency was only introduced recently, not too many people should rely on it being present)

@akrherz
Copy link

akrherz commented Feb 16, 2023

Is the dependency issue here with re2 that prevents newer libarrow + ucx (sans cudatoolkit dep) ?

libarrow-11.0.0-hadd514c_2_cpu -> re2 >=2023.2.1,<2023.2.2.0a0

@h-vetinari
Copy link
Member

It's very possible that ongoing migrations (e.g. libabseil) play a role in solvability issues. Re2 should be finished though, as in: everything should have been rebuilt for the newest version

@jorisvandenbossche
Copy link
Member

Is the dependency issue here with re2 that prevents newer libarrow + ucx (sans cudatoolkit dep) ?

Also newer versions of ucx have the issue, so the above might be the reason we currently get ucx 1.12. But also when forcing it to be 1.13, cudatoolkit still gets installed (see one of the outputs in #962 (comment))

@akrherz
Copy link

akrherz commented Feb 16, 2023

I'm now at this point:

$ mamba create -n dev python=3.10 ucx=1.13.1=h538f049_1 libgoogle-cloud=2.7.0 pyarrow=11.0.0
- package pyarrow-11.0.0-py310hc81d9b2_0_cuda requires libarrow 11.0.0 h3793eca_0_cuda, but none of the providers can be installed
$ mamba create -n dev python=3.10 ucx=1.13.1 pyarrow=11.0.0
+ libgoogle-cloud            2.5.0  h21dfe5b_1 

So I'm current suspecting libgoogle-cloud ? but I trust @h-vetinari 's assessment way more than mine!

Indeed: conda-forge/google-cloud-cpp-feedstock#126

@h-vetinari
Copy link
Member

So I'm current suspecting libgoogle-cloud ?

This could be if any other package in the environment pins google-cloud-cpp, so it doesn't get recognised (and updated) by the migrator. Will have a look later.

Indeed: conda-forge/google-cloud-cpp-feedstock#126

That's not it, arrow hasn't been rebuilt for the newest abseil yet either.

@akrherz
Copy link

akrherz commented Feb 19, 2023

FWIW, this issue is now gone for me with updated conda-forge builds no longer requiring ucx+cudatoolkit.

@h-vetinari h-vetinari changed the title Is it nessesary to install cudatoolkit with pyarrow 11 on Linux? Is it necessary to install cudatoolkit with pyarrow 11 on Linux? Feb 19, 2023
@h-vetinari
Copy link
Member

Closing this issue now; let us know if there are problems like this again please (planning to reinstate ucx-support as soon as that feedstock has separated out the cudatoolkit dependency more cleanly)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants