Skip to content

build(cudf): Simplify cuDF build configuration#11407

Closed
bdice wants to merge 5 commits intoapache:mainfrom
bdice:fix-cudf
Closed

build(cudf): Simplify cuDF build configuration#11407
bdice wants to merge 5 commits intoapache:mainfrom
bdice:fix-cudf

Conversation

@bdice
Copy link

@bdice bdice commented Jan 13, 2026

What changes are proposed in this pull request?

This is a follow-up to my comments on #11275.

This changeset should make it simpler to build with cuDF support.

  • Use CCCL from cudf
    • We should use cuDF's CCCL (fetched from GitHub) instead of finding it from the CUDA Toolkit. This means we can build with any supported CUDA Toolkit version. This is a hard requirement for compatibility with some RAPIDS versions like the upcoming 26.02 release, which uses a CCCL version that isn't shipped in a CUDA Toolkit yet.
  • Fix CMakeLists to use find_package(cudf)

How was this patch tested?

I built this locally in a container.

Copy link
Author

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhouyuan Could you trigger CI on this PR? I did some local testing but I'd like to confirm that it's working in CI.

ls -l /usr/local/
source /opt/rh/gcc-toolset-12/enable

source /opt/rh/gcc-toolset-14/enable
Copy link
Author

@bdice bdice Jan 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do need GCC 14, but we could remove the extra steps above from #11275 that change the CUDA version if you wish. This PR should make it work with the CUDA 12 version that already exists in the container. I know there were quite a few workarounds to reduce the disk space to make room for CUDA 13.1 -- we could revert that too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you'd like me to help revert those changes and minimize the build scripts, I can do that. Let me know your thoughts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line enables the GCC14 though I don't know why we cannot enable opt/rh/gcc-toolset-14/enable directly. Do you try if the docker file can work?https://github.com/apache/incubator-gluten/blob/main/dev/docker/cudf/Dockerfile, I meet curl version issue before.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok for me to use cuda 13.1, I have resolved all the version mismatch issues, and meet a new issue with the newest Velox, I will try to fix it

26/01/13 10:19:08 ERROR Executor: Exception in task 7.0 in stage 40.0 (TID 12116)
org.apache.gluten.exception.GlutenException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: (1 vs. 0) Leaf child memory pool cudf-expr-precompile already exists in __sys_root__
Retriable: False
Expression: children_.count(name) == 0
Function: addLeafChild
File: /opt/gluten/ep/build-velox/build/velox_ep/velox/common/memory/MemoryPool.cpp
Line: 331
Stack trace:

@bdice bdice marked this pull request as ready for review January 13, 2026 15:43
@jinchengchenghh
Copy link
Contributor

Some of them is duplicated with #11386, it's ok for me to merge any of them

@@ -31,7 +31,7 @@ WORKDIR /opt/gluten
RUN rm -rf /opt/rh/gcc-toolset-12 && ln -s /opt/rh/gcc-toolset-14 /opt/rh/gcc-toolset-12; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you try to create docker image from this Dockerfile? I meet curl version issue before, please help verify if this PR can resolve it.

Copy link
Contributor

@jinchengchenghh jinchengchenghh Jan 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It cannot run successfully

692.8 -- [CURL] Enabled SSL backends: OpenSSL
692.8 -- Setting DuckDB source to AUTO
692.8 -- [DuckDB] Using SYSTEM DuckDB
692.8 -- Using ccache: /usr/bin/ccache
692.8 -- The CUDA compiler identification is unknown
692.8 -- Configuring incomplete, errors occurred!
692.8 make[1]: Leaving directory '/opt/gluten/ep/build-velox/build/velox_ep'
Dockerfile:31
--------------------
  30 |     WORKDIR /opt/gluten
  31 | >>> RUN rm -rf /opt/rh/gcc-toolset-12 && ln -s /opt/rh/gcc-toolset-14 /opt/rh/gcc-toolset-12; \
  32 | >>>     dnf remove -y cuda-toolkit-12* && dnf install -y cuda-toolkit-13-1; \
  33 | >>>     dnf autoremove -y && dnf clean all; \
  34 | >>>     source /opt/rh/gcc-toolset-14/enable; \
  35 | >>>     bash ./dev/buildbundle-veloxbe.sh --run_setup_script=OFF --build_arrow=ON --spark_version=3.4 --build_tests=ON --build_benchmarks=ON --enable_gpu=ON && rm -rf /opt/gluten
  36 |     
--------------------
ERROR: failed to solve: process "/bin/sh -c rm -rf /opt/rh/gcc-toolset-12 && ln -s /opt/rh/gcc-toolset-14 /opt/rh/gcc-toolset-12;     dnf remove -y cuda-toolkit-12* && dnf install -y cuda-toolkit-13-1;     dnf autoremove -y && dnf clean all;     source /opt/rh/gcc-toolset-14/enable;     bash ./dev/buildbundle-veloxbe.sh --run_setup_script=OFF --build_arrow=ON --spark_version=3.4 --build_tests=ON --build_benchmarks=ON --enable_gpu=ON && rm -rf /opt/gluten" did not complete successfully: exit code: 2

@jinchengchenghh
Copy link
Contributor

This pipleline failed though the final result returns success flag, https://github.com/apache/incubator-gluten/actions/runs/20962505327/job/60252099418?pr=11407

In file included from /work/dev/../ep/build-velox/build/velox_ep/_build/release/_deps/cudf-src/cpp/include/cudf/utilities/default_stream.hpp:10,
                 from /work/dev/../ep/build-velox/build/velox_ep/_build/release/_deps/cudf-src/cpp/include/cudf/column/column_view.hpp:8,
                 from /work/dev/../ep/build-velox/build/velox_ep/_build/release/_deps/cudf-src/cpp/include/cudf/column/column.hpp:7,
                 from /work/dev/../ep/build-velox/build/velox_ep/_build/release/_deps/cudf-src/cpp/include/cudf/table/table.hpp:7,
                 from /work/dev/../ep/build-velox/build/velox_ep/velox/experimental/cudf/exec/VeloxCudfInterop.h:22,
                 from /work/cpp/velox/tests/VeloxGpuShuffleWriterTest.cc:32:
/work/dev/../ep/build-velox/build/velox_ep/_build/release/_deps/rmm-src/cpp/include/rmm/cuda_stream_view.hpp:10:10: fatal error: cuda/stream_ref: No such file or directory
   10 | #include <cuda/stream_ref>
      |          ^~~~~~~~~~~~~~~~~
compilation terminated.

dnf remove -y cuda-toolkit-12* && dnf install -y cuda-toolkit-13-1; \
dnf autoremove -y && dnf clean all; \
source /opt/rh/gcc-toolset-12/enable; \
source /opt/rh/gcc-toolset-14/enable; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it because we should not source gcc14?

CMake Error at CMakeLists.txt:476 (enable_language):
  The CMAKE_CUDA_COMPILER:

    /usr/local/cuda-12.8/bin/nvcc

  is not a full path to an existing compiler tool.

  Tell CMake where to find the compiler by setting either the environment
  variable "CUDACXX" or the CMake cache entry CMAKE_CUDA_COMPILER to the full
  path to the compiler, or to the compiler name if it is in the PATH.

@jinchengchenghh
Copy link
Contributor

Please help update

diff --git a/ep/build-velox/src/build-velox.sh b/ep/build-velox/src/build-velox.sh
index 3e0f6be..ac62f8f 100755
--- a/ep/build-velox/src/build-velox.sh
+++ b/ep/build-velox/src/build-velox.sh
@@ -134,8 +134,8 @@ function compile {
   if [ $ENABLE_GPU == "ON" ]; then
     # the cuda default options are for Centos9 image from Meta
     echo "enable GPU support."
-    COMPILE_OPTION="$COMPILE_OPTION -DVELOX_ENABLE_GPU=ON -DVELOX_ENABLE_CUDF=ON -DCMAKE_CUDA_ARCHITECTURES=70 \
-        -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.8/bin/nvcc"
+    COMPILE_OPTION="$COMPILE_OPTION -DVELOX_ENABLE_GPU=ON -DVELOX_ENABLE_CUDF=ON -DCMAKE_CUDA_ARCHITECTURES=75 \
+        -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc"
   fi
   if [ -n "${GLUTEN_VCPKG_ENABLED:-}" ]; then
     COMPILE_OPTION="$COMPILE_OPTION -DVELOX_GFLAGS_TYPE=static"

@jinchengchenghh
Copy link
Contributor

Do you know why the CI succeeds but build failed? #11407 (comment) @PHILO-HE

@jinchengchenghh
Copy link
Contributor

Thanks for your fix, this PR #11386 is ahead of yours and has been merged, the CI passed, if you think use cudf_DIR is more reasonable, please resolve the CI and build with the Dockerfile successfully.

@PHILO-HE
Copy link
Member

Do you know why the CI succeeds but build failed? #11407 (comment) @PHILO-HE

@jinchengchenghh, sorry for missing this comment. Because set -e should be explicitly added when passing commands via docker run bash -c to stop the job if any errors occur, i.e., docker run bash -c "set -e xxx". Alternatively, I would recommend to use the standard GitHub Actions container field with apache/gluten:centos-9-jdk8-cudf set. In an existing job, we are using docker run bash -c for centos 7 because that image is incompatible with the GHA checkout action. Since it's centos 9 here, I assume the standard way should work.

@jinchengchenghh
Copy link
Contributor

Thanks for your explanation, I will try the standard GitHub Actions container field with apache/gluten:centos-9-jdk8-cudf, @PHILO-HE

@bdice
Copy link
Author

bdice commented Jan 22, 2026

Thanks @jinchengchenghh for #11386. I'll close this, I think you got most of the important parts there.

@bdice bdice closed this Jan 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants