Docker Image OpenADK with CUDA support has an issue with CUDA installation #4765

gitoabdelgawad · 2024-05-22T14:15:23Z

Checklist

I've read the contribution guidelines.
I've searched other issues and no duplicate issues were found.
I'm convinced that this is not my fault but a bug.

Description

Inside ghcr.io/autowarefoundation/autoware-openadk:latest-devel-cuda container
Im trying to use tensorrt_yolox package. The package includes some CUDA kernels which fails to build and shows the following warning:

--- stderr: tensorrt_yolox
CMake Warning at CMakeLists.txt:19 (message):
CUDA is not found. preprocess acceleration using CUDA will not be
available.

It seems that CMake variable CMAKE_CUDA_COMPILER is not set

Then while using tensorrt_yolox for object detection, the system crashes with the following error:

[tensorrt_yolox_node_exe-2] /home/os/elm/autoware/install/tensorrt_yolox/lib/tensorrt_yolox/tensorrt_yolox_node_exe: symbol lookup error: /home/os/elm/autoware/install/tensorrt_yolox/lib/libtensorrt_yolox.so: undefined symbol: _ZN14tensorrt_yolox50resize_bilinear_letterbox_nhwc_to_nchw32_batch_gpuEPfPhiiiiiiifP11CUstream_st
[ERROR] [tensorrt_yolox_node_exe-2]: process has died [pid 977, exit code 127, cmd '/home/os/elm/autoware/install/tensorrt_yolox/lib/tensorrt_yolox/tensorrt_yolox_node_exe --ros-args -r __node:=tensorrt_yolox --params-file /tmp/launch_params_d1ll7q3z --params-file /tmp/launch_params_cq_ya7ic -r ~/in/image:=/fr_camera/image_rect -r ~/out/objects:=roi0'].

The missing symbol is actually a CUDA kernel that failed to build previously.

Expected behavior

Docker OpenADK Image should have the CUDA support and be able to properly build tensorrt_yolox. By doing that, the runtime error of the missing symbol will not be there anymore.

Actual behavior

tensorrt_yolox builds with a Warning and skips building the CUDA kernels, which leads to a runtime crash later.

Steps to reproduce

Inside ghcr.io/autowarefoundation/autoware-openadk:latest-devel-cuda container

source autoware/install/setup.bash
colcon build --symlink-install --cmake-args -DCMAKE_BUILD_TYPE=Release --packages-select tensorrt_yolox you should notice the cmake warning mentioned above.
ros2 launch tensorrt_yolox yolox_s_plus_opt.launch.xml input/image:=/img output/objects:=/roi0 Thats an example for launch an object detection model. Once you subscribe to output topic ros2 topic echo /roi0 you should get the runtime error mentioned above.

Versions

No response

Possible causes

After some investigation and trying to build the official CUDA Samples to track the issue, it appeared that some cuda libraries were missing
/usr/bin/ld: cannot find -lcudadevrt
/usr/bin/ld: cannot find -lcudart_static

After applying the following patch and rebuilding the docker image, the cuda kernels were built and object detection model was running well.

From 52d5e470d616118d0089e1ff25e5c8016a95450b Mon Sep 17 00:00:00 2001
From: Osama Abdelgawad <oaohaeg@gmail.com>
Date: Wed, 22 May 2024 16:01:59 +0200
Subject: [PATCH] docker change

---
 docker/autoware-openadk/Dockerfile | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/docker/autoware-openadk/Dockerfile b/docker/autoware-openadk/Dockerfile
index 23d260f0..320262ff 100644
--- a/docker/autoware-openadk/Dockerfile
+++ b/docker/autoware-openadk/Dockerfile
@@ -88,9 +88,7 @@ ENV CXX="/usr/lib/ccache/g++"
 RUN --mount=type=ssh \
   ./setup-dev-env.sh -y --module all ${SETUP_ARGS} --no-cuda-drivers openadk \
   && pip uninstall -y ansible ansible-core \
-  && apt-get autoremove -y && apt-get clean -y && rm -rf /var/lib/apt/lists/* "$HOME"/.cache \
-  && find / -name 'libcu*.a' -delete \
-  && find / -name 'libnv*.a' -delete
+  && apt-get autoremove -y && apt-get clean -y && rm -rf /var/lib/apt/lists/* "$HOME"/.cache
 
 # Install rosdep dependencies
 COPY --from=src-imported /autoware/src /autoware/src
-- 
2.34.1

Additional context

No response

Signed-off-by: Oguz Ozturk <oguzkaganozt@gmail.com>

xmfcx · 2024-06-19T07:00:42Z

@gitoabdelgawad @oguzkaganozt does/did this PR fix this issue?

feat(docker): fix CUDA compile on devel image and improve run.sh #4849

oguzkaganozt added a commit that referenced this issue May 23, 2024

fix #4765

c08a723

Signed-off-by: Oguz Ozturk <oguzkaganozt@gmail.com>

idorobotics added the type:bug Software flaws or errors. label May 23, 2024

mitsudome-r mentioned this issue May 24, 2024

feat(docker): provide modular images for openadkit planning simulator visualizer #4673

Open

7 tasks

oguzkaganozt added a commit that referenced this issue May 27, 2024

fix #4765

ec2b35d

Signed-off-by: Oguz Ozturk <oguzkaganozt@gmail.com>

oguzkaganozt added a commit that referenced this issue May 27, 2024

fix #4765

7699bab

Signed-off-by: Oguz Ozturk <oguzkaganozt@gmail.com>

oguzkaganozt added a commit that referenced this issue May 28, 2024

fix #4765

6608bfe

Signed-off-by: Oguz Ozturk <oguzkaganozt@gmail.com>

oguzkaganozt mentioned this issue Jun 12, 2024

feat(docker): fix CUDA compile on devel image and improve run.sh #4849

Merged

4 tasks

xmfcx assigned oguzkaganozt Jun 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker Image OpenADK with CUDA support has an issue with CUDA installation #4765

Docker Image OpenADK with CUDA support has an issue with CUDA installation #4765

gitoabdelgawad commented May 22, 2024 •

edited

xmfcx commented Jun 19, 2024

Docker Image OpenADK with CUDA support has an issue with CUDA installation #4765

Docker Image OpenADK with CUDA support has an issue with CUDA installation #4765

Comments

gitoabdelgawad commented May 22, 2024 • edited

Checklist

Description

Expected behavior

Actual behavior

Steps to reproduce

Versions

Possible causes

Additional context

xmfcx commented Jun 19, 2024

gitoabdelgawad commented May 22, 2024 •

edited