Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[Build] Add a reasonable default for CMAKE_CUDA_COMPILER in *nix #17293

Merged
merged 5 commits into from
Feb 25, 2020

Conversation

larroy
Copy link
Contributor

@larroy larroy commented Jan 13, 2020

Description

After recent changes to CMake, CMAKE_CUDA_COMPILER is not picked up automatically, as nvcc is not usually on the PATH. This sets a reasonable default which is also used by convention by NVidia tooling which symlinks /usr/local/cuda to the default cuda version.

@larroy larroy requested a review from szha as a code owner January 13, 2020 23:47
@larroy
Copy link
Contributor Author

larroy commented Jan 13, 2020

@mxnet-label-bot add [pr-awaiting-review]

@lanking520 lanking520 added the pr-awaiting-review PR is waiting for code review label Jan 13, 2020
@larroy
Copy link
Contributor Author

larroy commented Jan 13, 2020

@mxnet-label-bot add [Build]

@larroy
Copy link
Contributor Author

larroy commented Jan 13, 2020

@leezu

@leezu
Copy link
Contributor

leezu commented Jan 14, 2020

Why would nvcc not be on the PATH? Could you provide an example system for reference that comes with this setup?

@larroy
Copy link
Contributor Author

larroy commented Jan 14, 2020

nvcc is normally not in the PATH in ubuntu nvidia packages it goes in the path that you can see in this PR.

@larroy
Copy link
Contributor Author

larroy commented Jan 14, 2020

/usr/local/cuda/bin/nvcc is usually NOT in the path.

@leezu
Copy link
Contributor

leezu commented Jan 15, 2020

It's expected that to compile software, the compiler must be available. For that, it must be either on $PATH or manually specified.

With respect to C and C++ compilers, this is what the CC and CXX environment variables are for.
For example, CC=gcc-9 CXX=g++-9 cmake .. will prepare the build with GCC 9.

Alike, if users want to use non-standard nvcc (ie nvcc that is not on PATH), they can set CUDACXX. CUDACXX=/usr/local/cuda/nvcc cmake ...

I think it's better to follow standard practice instead of taking additional assumptions. For example, we may want to use clang to compile the cuda files instead of nvcc. If nvcc is not on path, users may reasonably expect that clang will be used.

What do you think?

CMakeLists.txt Outdated
@@ -84,6 +84,10 @@ message(STATUS "CMake version '${CMAKE_VERSION}' using generator '${CMAKE_GENERA
project(mxnet C CXX)
if(USE_CUDA)
cmake_minimum_required(VERSION 3.13.2) # CUDA 10 (Turing) detection available starting 3.13.2
if (NOT MSVC AND (NOT DEFINED CMAKE_CUDA_COMPILER OR "${CMAKE_CUDA_COMPILER}" STREQUAL "CMAKE_CUDA_COMPILER-NOTFOUND"))
Copy link
Contributor

@leezu leezu Jan 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be required to check CUDACXX?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It worked fine with this, could you expand on your concern / question?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your current logic may break users from setting CUDACXX environment variable? That's the standard way of defining Cuda compiler. It's good to follow standards to avoid technical debt

CMakeLists.txt Show resolved Hide resolved
@apeforest
Copy link
Contributor

@larroy In which environment did you see this problem? Could you paste the diagnose.py result here?

@larroy
Copy link
Contributor Author

larroy commented Jan 15, 2020

@apeforest ubuntu 18.04 with nvidia machine learning APT repositories, pretty standard. I will update with diagnose.py as requested, thanks.

@leezu
Copy link
Contributor

leezu commented Jan 15, 2020

@larroy when adding the APT, users should add the respective folder to the PATH. Is it not documented on the Nvidia page?

@larroy
Copy link
Contributor Author

larroy commented Jan 16, 2020

We never had to do such a thing. This is happening due to CMake changes. I applied your suggestion. I would suggest to apply my proposed patch which makes it smoother in 99% of the cases for users.

@larroy
Copy link
Contributor Author

larroy commented Jan 16, 2020

----------Python Info----------
('Version      :', '2.7.17')
('Compiler     :', 'GCC 7.4.0')
('Build        :', ('default', 'Nov  7 2019 10:07:09'))
('Arch         :', ('64bit', ''))
------------Pip Info-----------
No corresponding pip install for current python.
----------MXNet Info-----------
No MXNet installed.
----------System Info----------
('Platform     :', 'Linux-4.15.0-1054-aws-x86_64-with-Ubuntu-18.04-bionic')
('system       :', 'Linux')
('node         :', '34-222-129-72')
('release      :', '4.15.0-1054-aws')
('version      :', '#56-Ubuntu SMP Thu Nov 7 16:15:59 UTC 2019')
----------Hardware Info----------
('machine      :', 'x86_64')
('processor    :', 'x86_64')
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
On-line CPU(s) list: 0-7
Thread(s) per core:  2
Core(s) per socket:  4
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:            1
CPU MHz:             1455.803
CPU max MHz:         3000.0000
CPU min MHz:         1200.0000
BogoMIPS:            4600.12
Hypervisor vendor:   Xen
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            46080K
NUMA node0 CPU(s):   0-7
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0029 sec, LOAD: 0.4997 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0168 sec, LOAD: 0.3225 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0236 sec, LOAD: 0.1133 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0100 sec, LOAD: 0.0518 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1954 sec, LOAD: 0.2625 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.3784 sec, LOAD: 0.1460 sec.
----------Environment----------

@larroy
Copy link
Contributor Author

larroy commented Jan 16, 2020

piotr@34-222-129-72:0:~/mxnet (cmake_cuda_compiler)+$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.3 LTS
Release:        18.04
Codename:       bionic

@larroy
Copy link
Contributor Author

larroy commented Jan 16, 2020

#17031

@larroy
Copy link
Contributor Author

larroy commented Jan 16, 2020

@mxnet-label-bot add [breaking]

@larroy
Copy link
Contributor Author

larroy commented Jan 16, 2020

This fixes #15492

Copy link
Contributor

@samskalicky samskalicky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@ChaiBapchya ChaiBapchya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

Copy link
Contributor

@leezu leezu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify if this breaks CUDACXX env variable.

Further, this only helps users that didn't install Cuda correctly. See the mandatory actions in the Cuda installation guide https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#mandatory-post

I don't think we should include logic into our build logic to handle systems in broken state.

@larroy
Copy link
Contributor Author

larroy commented Jan 16, 2020

What do you suggest wrt https://cmake.org/cmake/help/v3.13/envvar/CUDACXX.html should we check if this is set? I understand your concern. Please suggest a better approach, I'm not a CMake expert, but this was working before without needing to set any paths. Even though you are right about the documentation from Nvidia.

I can compile pytorch just fine without needing to do any additional changes to PATHs, or environments. When there's a single cuda version installed or symlinked to /usr/local/cuda we should just pick up that one unless specified otherwise.

Please propose changes or alternatives.

@larroy
Copy link
Contributor Author

larroy commented Jan 16, 2020

This is the output from pytorch build:


-- Found CUDA: /usr/local/cuda (found version "10.2")
-- Caffe2: CUDA detected: 10.2
-- Caffe2: CUDA nvcc is: /usr/local/cuda/bin/nvcc
-- Caffe2: CUDA toolkit directory: /usr/local/cuda
-- Caffe2: Header version is: 10.2
-- Found CUDNN: /usr/lib/x86_64-linux-gnu/libcudnn.so
-- Found cuDNN: v7.6.5  (include: /usr/include, library: /usr/lib/x86_64-linux-gnu/libcudnn.so)
-- Autodetected CUDA architecture(s):  7.0
-- Added CUDA NVCC flags for: -gencode;arch=compute_70,code=sm_70
-- Autodetected CUDA architecture(s):  7.0

@leezu
Copy link
Contributor

leezu commented Jan 16, 2020

@larroy, yes it's required to check the environment variables (both CUDACXX and PATH) first before "falling back" to some default path. But it may be hard check both environment variables correctly. So let's just rely on cmake to figure out if it can find nvcc by standard means. Then only if this fails, fall back to the path.

Thus I suggest the following approach instead

  check_language(CUDA)
  if (NOT CMAKE_CUDA_COMPILER_LOADED AND UNIX AND EXISTS "/usr/local/cuda/bin/nvcc")
    set(CMAKE_CUDA_COMPILER "/usr/local/cuda/bin/nvcc")
    message(WARNING "CMAKE_CUDA_COMPILER guessed: ${CMAKE_CUDA_COMPILER}")
  endif()

It should be placed at the same position as the changes done in this PR.

My concern is that if nvcc is not on the PATH, users may also have forgotten to set LD_LIBRARY_PATH. This will lead to issues when attempting to load mxnet later.
Thus I think it's preferable to educate users how to set up there system correctly, instead of attempting to work with broken systems (as we would never be able to catch all various ways that the system may be broken).

@larroy
Copy link
Contributor Author

larroy commented Jan 16, 2020

Thanks for the clarifications, make sense. I don't think LD_LIBRARY_PATH will be an issue in this case. As the mxnet so points to the right library, I can show that this is the case. I disagree with you regarding "broken system". /usr/local/cuda is the convention for default cuda installation, even though is not really in the nvidia documentation. I see your point, but we should just work by default, as before.

https://docs.roguewave.com/en/totalview/2018/html/index.html#page/User_Guides/totalviewug-about-cuda.32.3.html

@leezu
Copy link
Contributor

leezu commented Jan 16, 2020

LD_LIBRARY_PATH will not cause issues in this case, but it's a related problem source. It will cause problems if users install cuda via runfile and forget to set LD_LIBRARY_PATH.

In any case, it's just an example for why we can't handle all kinds of broken systems.

Given the updated strategy using check_language(CUDA) it should be fine to merge this PR

@larroy
Copy link
Contributor Author

larroy commented Jan 28, 2020

@leezu then approve please

@leezu
Copy link
Contributor

leezu commented Jan 28, 2020

@larroy why not use the approach in #17293 (comment)

I don't think the PR handles the case described correctly yet.

@leezu
Copy link
Contributor

leezu commented Feb 24, 2020

@larroy why close this issues?

You can just copy the suggested code change and push, then it can be merged:

  check_language(CUDA)
  if (NOT CMAKE_CUDA_COMPILER_LOADED AND UNIX AND EXISTS "/usr/local/cuda/bin/nvcc")
    set(CMAKE_CUDA_COMPILER "/usr/local/cuda/bin/nvcc")
    message(WARNING "CMAKE_CUDA_COMPILER guessed: ${CMAKE_CUDA_COMPILER}")
  endif()

@leezu leezu reopened this Feb 24, 2020
@larroy
Copy link
Contributor Author

larroy commented Feb 24, 2020

I don't have much bandwidth left, but if the change is this small I can finish the PR. Seems Linux GPU is timeouting often though.

Copy link
Contributor

@leezu leezu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@leezu leezu merged commit 8bdf068 into apache:master Feb 25, 2020
leezu added a commit that referenced this pull request Mar 5, 2020
Fixes a bug in #17293 causing an infinite loop on some systems.
MoisesHer pushed a commit to MoisesHer/incubator-mxnet that referenced this pull request Apr 10, 2020
Fixes a bug in apache#17293 causing an infinite loop on some systems.
anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request May 29, 2020
* [Build] Add a reasonable default for CMAKE_CUDA_COMPILER in *nix

* CR

* CR

* Update as per CR comments

* include(CheckLanguage)

Co-authored-by: Leonard Lausen <leonard@lausen.nl>
anirudh2290 pushed a commit to anirudh2290/mxnet that referenced this pull request May 29, 2020
Fixes a bug in apache#17293 causing an infinite loop on some systems.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Breaking Build pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants