build(cudf): Simplify cuDF build configuration by bdice · Pull Request #11407 · apache/gluten

bdice · 2026-01-13T15:31:42Z

What changes are proposed in this pull request?

This is a follow-up to my comments on #11275.

This changeset should make it simpler to build with cuDF support.

Use CCCL from cudf
- We should use cuDF's CCCL (fetched from GitHub) instead of finding it from the CUDA Toolkit. This means we can build with any supported CUDA Toolkit version. This is a hard requirement for compatibility with some RAPIDS versions like the upcoming 26.02 release, which uses a CCCL version that isn't shipped in a CUDA Toolkit yet.
Fix CMakeLists to use find_package(cudf)

How was this patch tested?

I built this locally in a container.

bdice

@zhouyuan Could you trigger CI on this PR? I did some local testing but I'd like to confirm that it's working in CI.

bdice · 2026-01-13T15:33:21Z

.github/workflows/velox_backend_cache.yml

          ls -l /usr/local/
-          source /opt/rh/gcc-toolset-12/enable
+
+          source /opt/rh/gcc-toolset-14/enable


We do need GCC 14, but we could remove the extra steps above from #11275 that change the CUDA version if you wish. This PR should make it work with the CUDA 12 version that already exists in the container. I know there were quite a few workarounds to reduce the disk space to make room for CUDA 13.1 -- we could revert that too.

If you'd like me to help revert those changes and minimize the build scripts, I can do that. Let me know your thoughts.

This line enables the GCC14 though I don't know why we cannot enable opt/rh/gcc-toolset-14/enable directly. Do you try if the docker file can work?https://github.com/apache/incubator-gluten/blob/main/dev/docker/cudf/Dockerfile, I meet curl version issue before.

It's ok for me to use cuda 13.1, I have resolved all the version mismatch issues, and meet a new issue with the newest Velox, I will try to fix it

26/01/13 10:19:08 ERROR Executor: Exception in task 7.0 in stage 40.0 (TID 12116) org.apache.gluten.exception.GlutenException: Exception: VeloxRuntimeError Error Source: RUNTIME Error Code: INVALID_STATE Reason: (1 vs. 0) Leaf child memory pool cudf-expr-precompile already exists in __sys_root__ Retriable: False Expression: children_.count(name) == 0 Function: addLeafChild File: /opt/gluten/ep/build-velox/build/velox_ep/velox/common/memory/MemoryPool.cpp Line: 331 Stack trace:

jinchengchenghh · 2026-01-15T08:34:38Z

Some of them is duplicated with #11386, it's ok for me to merge any of them

jinchengchenghh · 2026-01-19T01:31:29Z

dev/docker/cudf/Dockerfile

@@ -31,7 +31,7 @@ WORKDIR /opt/gluten
 RUN rm -rf /opt/rh/gcc-toolset-12 && ln -s /opt/rh/gcc-toolset-14 /opt/rh/gcc-toolset-12; \


Do you try to create docker image from this Dockerfile? I meet curl version issue before, please help verify if this PR can resolve it.

It cannot run successfully

692.8 -- [CURL] Enabled SSL backends: OpenSSL 692.8 -- Setting DuckDB source to AUTO 692.8 -- [DuckDB] Using SYSTEM DuckDB 692.8 -- Using ccache: /usr/bin/ccache 692.8 -- The CUDA compiler identification is unknown 692.8 -- Configuring incomplete, errors occurred! 692.8 make[1]: Leaving directory '/opt/gluten/ep/build-velox/build/velox_ep' Dockerfile:31 -------------------- 30 | WORKDIR /opt/gluten 31 | >>> RUN rm -rf /opt/rh/gcc-toolset-12 && ln -s /opt/rh/gcc-toolset-14 /opt/rh/gcc-toolset-12; \ 32 | >>> dnf remove -y cuda-toolkit-12* && dnf install -y cuda-toolkit-13-1; \ 33 | >>> dnf autoremove -y && dnf clean all; \ 34 | >>> source /opt/rh/gcc-toolset-14/enable; \ 35 | >>> bash ./dev/buildbundle-veloxbe.sh --run_setup_script=OFF --build_arrow=ON --spark_version=3.4 --build_tests=ON --build_benchmarks=ON --enable_gpu=ON && rm -rf /opt/gluten 36 | -------------------- ERROR: failed to solve: process "/bin/sh -c rm -rf /opt/rh/gcc-toolset-12 && ln -s /opt/rh/gcc-toolset-14 /opt/rh/gcc-toolset-12; dnf remove -y cuda-toolkit-12* && dnf install -y cuda-toolkit-13-1; dnf autoremove -y && dnf clean all; source /opt/rh/gcc-toolset-14/enable; bash ./dev/buildbundle-veloxbe.sh --run_setup_script=OFF --build_arrow=ON --spark_version=3.4 --build_tests=ON --build_benchmarks=ON --enable_gpu=ON && rm -rf /opt/gluten" did not complete successfully: exit code: 2

jinchengchenghh · 2026-01-19T02:20:07Z

This pipleline failed though the final result returns success flag, https://github.com/apache/incubator-gluten/actions/runs/20962505327/job/60252099418?pr=11407

In file included from /work/dev/../ep/build-velox/build/velox_ep/_build/release/_deps/cudf-src/cpp/include/cudf/utilities/default_stream.hpp:10,
                 from /work/dev/../ep/build-velox/build/velox_ep/_build/release/_deps/cudf-src/cpp/include/cudf/column/column_view.hpp:8,
                 from /work/dev/../ep/build-velox/build/velox_ep/_build/release/_deps/cudf-src/cpp/include/cudf/column/column.hpp:7,
                 from /work/dev/../ep/build-velox/build/velox_ep/_build/release/_deps/cudf-src/cpp/include/cudf/table/table.hpp:7,
                 from /work/dev/../ep/build-velox/build/velox_ep/velox/experimental/cudf/exec/VeloxCudfInterop.h:22,
                 from /work/cpp/velox/tests/VeloxGpuShuffleWriterTest.cc:32:
/work/dev/../ep/build-velox/build/velox_ep/_build/release/_deps/rmm-src/cpp/include/rmm/cuda_stream_view.hpp:10:10: fatal error: cuda/stream_ref: No such file or directory
   10 | #include <cuda/stream_ref>
      |          ^~~~~~~~~~~~~~~~~
compilation terminated.

jinchengchenghh · 2026-01-19T02:26:56Z

dev/docker/cudf/Dockerfile

    dnf remove -y cuda-toolkit-12* && dnf install -y cuda-toolkit-13-1; \
    dnf autoremove -y && dnf clean all; \
-    source /opt/rh/gcc-toolset-12/enable; \
+    source /opt/rh/gcc-toolset-14/enable; \


Is it because we should not source gcc14?

CMake Error at CMakeLists.txt:476 (enable_language): The CMAKE_CUDA_COMPILER: /usr/local/cuda-12.8/bin/nvcc is not a full path to an existing compiler tool. Tell CMake where to find the compiler by setting either the environment variable "CUDACXX" or the CMake cache entry CMAKE_CUDA_COMPILER to the full path to the compiler, or to the compiler name if it is in the PATH.

jinchengchenghh · 2026-01-19T03:12:35Z

Please help update

diff --git a/ep/build-velox/src/build-velox.sh b/ep/build-velox/src/build-velox.sh
index 3e0f6be..ac62f8f 100755
--- a/ep/build-velox/src/build-velox.sh
+++ b/ep/build-velox/src/build-velox.sh
@@ -134,8 +134,8 @@ function compile {
   if [ $ENABLE_GPU == "ON" ]; then
     # the cuda default options are for Centos9 image from Meta
     echo "enable GPU support."
-    COMPILE_OPTION="$COMPILE_OPTION -DVELOX_ENABLE_GPU=ON -DVELOX_ENABLE_CUDF=ON -DCMAKE_CUDA_ARCHITECTURES=70 \
-        -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.8/bin/nvcc"
+    COMPILE_OPTION="$COMPILE_OPTION -DVELOX_ENABLE_GPU=ON -DVELOX_ENABLE_CUDF=ON -DCMAKE_CUDA_ARCHITECTURES=75 \
+        -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc"
   fi
   if [ -n "${GLUTEN_VCPKG_ENABLED:-}" ]; then
     COMPILE_OPTION="$COMPILE_OPTION -DVELOX_GFLAGS_TYPE=static"

jinchengchenghh · 2026-01-19T08:46:31Z

Do you know why the CI succeeds but build failed? #11407 (comment) @PHILO-HE

jinchengchenghh · 2026-01-22T05:20:33Z

Thanks for your fix, this PR #11386 is ahead of yours and has been merged, the CI passed, if you think use cudf_DIR is more reasonable, please resolve the CI and build with the Dockerfile successfully.

PHILO-HE · 2026-01-22T06:17:35Z

Do you know why the CI succeeds but build failed? #11407 (comment) @PHILO-HE

@jinchengchenghh, sorry for missing this comment. Because set -e should be explicitly added when passing commands via docker run bash -c to stop the job if any errors occur, i.e., docker run bash -c "set -e xxx". Alternatively, I would recommend to use the standard GitHub Actions container field with apache/gluten:centos-9-jdk8-cudf set. In an existing job, we are using docker run bash -c for centos 7 because that image is incompatible with the GHA checkout action. Since it's centos 9 here, I assume the standard way should work.

jinchengchenghh · 2026-01-22T06:21:50Z

Thanks for your explanation, I will try the standard GitHub Actions container field with apache/gluten:centos-9-jdk8-cudf, @PHILO-HE

bdice · 2026-01-22T23:21:12Z

Thanks @jinchengchenghh for #11386. I'll close this, I think you got most of the important parts there.

bdice added 5 commits January 13, 2026 09:23

Use CCCL from cudf

333e4b4

Fix CMakeLists to find_package(cudf)

6842b18

Remove workarounds to enabled GCC 14

0518453

Fix CMake

1f638a9

Use CC/CXX if set

14b4428

github-actions bot added BUILD VELOX INFRA labels Jan 13, 2026

bdice commented Jan 13, 2026

View reviewed changes

bdice mentioned this pull request Jan 13, 2026

[GLUTEN-11302][VL] Fix gpu build by bumping to cuda-13.1 #11275

Merged

bdice marked this pull request as ready for review January 13, 2026 15:43

zhouyuan requested a review from jinchengchenghh January 14, 2026 11:08

jinchengchenghh reviewed Jan 19, 2026

View reviewed changes

jinchengchenghh mentioned this pull request Jan 22, 2026

[VL][CI] Fix CI reporting success despite command failures by adding 'set -e' #11462

Merged

bdice closed this Jan 22, 2026

		@@ -31,7 +31,7 @@ WORKDIR /opt/gluten
		RUN rm -rf /opt/rh/gcc-toolset-12 && ln -s /opt/rh/gcc-toolset-14 /opt/rh/gcc-toolset-12; \

Conversation

bdice commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

How was this patch tested?

Uh oh!

bdice left a comment

Choose a reason for hiding this comment

Uh oh!

bdice Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bdice Jan 13, 2026

Choose a reason for hiding this comment

Uh oh!

jinchengchenghh Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

jinchengchenghh Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

jinchengchenghh commented Jan 15, 2026

Uh oh!

jinchengchenghh Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

jinchengchenghh Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jinchengchenghh commented Jan 19, 2026

Uh oh!

jinchengchenghh Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

jinchengchenghh commented Jan 19, 2026

Uh oh!

jinchengchenghh commented Jan 19, 2026

Uh oh!

jinchengchenghh commented Jan 22, 2026

Uh oh!

PHILO-HE commented Jan 22, 2026

Uh oh!

jinchengchenghh commented Jan 22, 2026

Uh oh!

bdice commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bdice commented Jan 13, 2026 •

edited

Loading

bdice Jan 13, 2026 •

edited

Loading

jinchengchenghh Jan 19, 2026 •

edited

Loading