Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build error on POP! OS 21.10 #49

Open
Jay-Jay-D opened this issue Apr 12, 2022 · 12 comments
Open

Build error on POP! OS 21.10 #49

Jay-Jay-D opened this issue Apr 12, 2022 · 12 comments

Comments

@Jay-Jay-D
Copy link

Jay-Jay-D commented Apr 12, 2022

I installed all the required dependencies and followed the instructions on the Readme, and I had no problem, but it failed in the last step cmake --build . --config Release -j8.

The error message is

[  9%] Built target alien_base_lib
[  9%] Built target im_file_dialog
[ 10%] Building CUDA object source/EngineGpuKernels/CMakeFiles/alien_engine_gpu_kernels_lib.dir/CudaSimulationFacade.cu.o
[ 18%] Built target alien_engine_interface_lib
/home/jjd/REPOS/alien/source/EngineGpuKernels/CudaSimulationFacade.cu(46): error: variable "instance" was declared but never referenced

1 error detected in the compilation of "/home/jjd/REPOS/alien/source/EngineGpuKernels/CudaSimulationFacade.cu".
gmake[2]: *** [source/EngineGpuKernels/CMakeFiles/alien_engine_gpu_kernels_lib.dir/build.make:95: source/EngineGpuKernels/CMakeFiles/alien_engine_gpu_kernels_lib.dir/CudaSimulationFacade.cu.o] Error 1
gmake[1]: *** [CMakeFiles/Makefile2:358: source/EngineGpuKernels/CMakeFiles/alien_engine_gpu_kernels_lib.dir/all] Error 2

POP! OS is based on Ubuntu Impish, the GPU is a GeForce GTX 1060 6GB with nvidia470 driver.

Please ask any other information you need for better understand the issue.

Thanks in advance.

@chrxh
Copy link
Owner

chrxh commented Apr 12, 2022

The nvcc compiler finds a warning where there should be none
(it's the code line static void init() { [[maybe_unused]] static CudaInitializer instance; }).

A quick fix for you could be to disable the option which treats warnings as errors. This could be done by removing the line add_compile_options($<$<COMPILE_LANGUAGE:CUDA>:--Werror=all-warnings>) in CMakeList.txt.
Does it work?

But I'd like to understand the reason... Which version of the CUDA Toolkit are you using?

@Jay-Jay-D
Copy link
Author

Jay-Jay-D commented Apr 12, 2022

I tried the quick fix you suggested and the build continued up to the 98% and failed. Here is the build log.

Below is the output of nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.86       Driver Version: 470.86       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 40%   37C    P8    12W / 120W |    670MiB /  6075MiB |     14%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1774      G   /usr/lib/xorg/Xorg                416MiB |
|    0   N/A  N/A      2170      G   /usr/bin/gnome-shell               64MiB |
|    0   N/A  N/A     10508      G   /usr/lib/firefox/firefox          163MiB |
|    0   N/A  N/A     10837      G   gnome-control-center                1MiB |
|    0   N/A  N/A     14897      G   ...414957901845057616,131072       19MiB |
+-----------------------------------------------------------------------------+

Thank you very much for your quick response!

@chrxh
Copy link
Owner

chrxh commented Apr 12, 2022

I'm not an expert on such linker problems, but it seems to be the same as in https://githubhot.com/repo/owl-project/owl/issues/144 (last post). A workaround is given there. I've modified the build scripts on the test/nvlink-error branch accordingly. Could you please try it?

@Jay-Jay-D
Copy link
Author

Thank you very much for the help! Sadly, it didn't work, the output is the same as the one with the quick fix.

Disclaimer: this topic is WAY above my head, so, sorry if I'm playing Captain Obvious.

I starter diving into the issue from the error nvlink fatal : Could not open input file '/usr/lib/x86_64-linux-gnu/librt.a', the file exists but it's content is a single line.

jjd@pop-os:~$ ll /usr/lib/x86_64-linux-gnu/librt.a
-rw-r--r-- 1 root root 8 Feb 24 16:45 /usr/lib/x86_64-linux-gnu/librt.a
jjd@pop-os:~$ cat /usr/lib/x86_64-linux-gnu/librt.a
!<arch>

There was an gclib update August last year, and this answer in the Arch subredit points to that update as source of possible problems with the Nvidia linker.

Little further in the error message, it points to CMakeFiles/alien.dir/build.make:883, the content of lines 881 to 883 is:

CMakeFiles/alien.dir/cmake_device_link.o: CMakeFiles/alien.dir/dlink.txt
	@$(CMAKE_COMMAND) -E cmake_echo_color --switch=$(COLOR) --green --bold --progress-dir=/home/jjd/REPOS/alien/build/CMakeFiles --progress-num=$(CMAKE_PROGRESS_52) "Linking CUDA device code CMakeFiles/alien.dir/cmake_device_link.o"
	$(CMAKE_COMMAND) -E cmake_link_script CMakeFiles/alien.dir/dlink.txt --verbose=$(VERBOSE)

Finally, in the CMakeFiles/alien.dir/dlink.txt we find the usage of librt.a (column 3556) followed by -ldl, -lrt and -lpthread are at column 3860.

So, if those files are empty and cause error maybe the solution is somehow fix those links when generating the dlink.txt file?

Or perhaps is related with this issue?

Hope this helps to further understand the issue.

@chrxh
Copy link
Owner

chrxh commented Apr 14, 2022

I don't think that I can really help here :-( The linker error on your system seems to occur in general when compiling a CUDA project via a CMake script...
Probably one could also try to ask here https://forums.developer.nvidia.com/c/accelerated-computing/cuda/cuda-nvcc-compiler/454 ?
I would be interested in the answer. However, I cannot reproduce the problem on my system.

@martinxyz
Copy link

I did run into the same build error Debian testing/sid while compiling v3.2.1 (the "unused" warning).

After I deleted the unused variable in CudaSimulationFacade.cu everything compiled and seems to run just fine. (Btw. really cool project, thanks a lot.)

System information, in case it helps someone:

$ dpkg -l|grep -i nvidia-cuda
ii  nvidia-cuda-dev:amd64   11.4.3-4          amd64 NVIDIA CUDA development files
ii  nvidia-cuda-gdb         11.4.120~11.4.3-4 amd64 NVIDIA CUDA Debugger (GDB)
ii  nvidia-cuda-toolkit     11.4.3-4          amd64 NVIDIA CUDA development toolkit
ii  nvidia-cuda-toolkit-doc 11.4.3-4          all   NVIDIA CUDA and OpenCL documentation
$ nvidia-smi
Mon Jul 25 11:57:37 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:09:00.0  On |                  N/A |
|  0%   46C    P8    11W / 120W |    389MiB /  5941MiB |     10%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

@martinxyz
Copy link

martinxyz commented Jul 25, 2022

Yes it seems to be the cuda version.

I have upgraded from 11.4.3 to 11.5.2 (apt install -t experimental nvidia-cuda-dev nvidia-cuda-toolkit), reverted my change, created a new build folder, problem disappeared.

System info after update:

ii  nvidia-cuda-dev:amd64    11.5.2-1           amd64 NVIDIA CUDA development files
ii  nvidia-cuda-gdb          11.5.114~11.5.2-1  amd64 NVIDIA CUDA Debugger (GDB)
ii  nvidia-cuda-toolkit      11.5.2-1           amd64 NVIDIA CUDA development toolkit
ii  nvidia-cuda-toolkit-doc  11.5.2-1           all   NVIDIA CUDA and OpenCL documentation
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.08    Driver Version: 510.73.08    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+

@chrxh
Copy link
Owner

chrxh commented Jul 25, 2022

Thanks, good to know!
Besides the issue with the cuda toolkit version, is everything working? (i.e. the program starts normally and simulations can be opened?)
I ask because there were some interoperability issues in the past (e.g. simulations saved on a Windows machine could originally not be loaded in Linux).

@martinxyz
Copy link

Yes, I did run most of examples/simulations/. Dark Forest demo runs at ~60 steps per second, others faster. I've tried a few recent ones from the network browser, all good.

It was already working with the old cuda version (with the workaround). I did not run into the linker issue described by Jay-Jay-D, only the initial build-failure (warning).

@chrxh
Copy link
Owner

chrxh commented Jul 26, 2022

60 tps seems a bit slow. Which graphics card do you use? How fast is it without rendering (can be toggled with ALT+I)?

@martinxyz
Copy link

GeForce GTX 1660 Ti; without rendering I get ~100 (+/- 5) time steps per second. I have also tested on Windows now (dual-boot): ~70 with UI and 100 (+/-5) without. Could be Gnome doing some extra UI steps, IIRC that's a thing.

@chrxh
Copy link
Owner

chrxh commented Jul 26, 2022

Ok, I see. Rendering seems to consume a lot of time here. You can e.g. reduce the frame rate or resolution in the display settings if you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants