Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

__hmin/__hmax already defined on compute_cap 75 with newer driver version #762

Closed
opfromthestart opened this issue May 3, 2023 · 7 comments · Fixed by #788
Closed

__hmin/__hmax already defined on compute_cap 75 with newer driver version #762

opfromthestart opened this issue May 3, 2023 · 7 comments · Fixed by #788
Labels
bug Something isn't working

Comments

@opfromthestart
Copy link
Contributor

When I try to compile dfdx while using cuda, I get the following error

  --- stderr
  thread 'main' panicked at 'nvcc error while compiling "src/optim/adam/adam.cu":

  # stdout


  # stderr
  src/tensor_ops/utilities/compatibility.cuh(9): error: function "__hmax" has already been defined
    __attribute__((device)) __inline__ __attribute__((always_inline)) __half __hmax(__half a, __half b) {
                                                                             ^

  src/tensor_ops/utilities/compatibility.cuh(12): error: function "__hmin" has already been defined
    __attribute__((device)) __inline__ __attribute__((always_inline)) __half __hmin(__half a, __half b) {
                                                                             ^

  2 errors detected in the compilation of "src/optim/adam/adam.cu".

My guess is that its related to the fix for compatibility of 75, which I think I had but I updated my drivers so it now it is already defined.

@opfromthestart
Copy link
Contributor Author

nvidia-smi --query-gpu compute_cap --format=csv gives compute_cap 7.5
nvcc --version gives

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

nvcc --list-gpu-code gives

sm_50
sm_52
sm_53
sm_60
sm_61
sm_62
sm_70
sm_72
sm_75
sm_80
sm_86
sm_87
sm_89
sm_90

@coreylowman
Copy link
Owner

Can you expand on what you mean by updated your drivers? I guess I had assumed all compute_caps of the same number would have similar issues, but you're still compiling with 75 and getting this error?

@opfromthestart
Copy link
Contributor Author

My drivers were on version 525 and I had version 11.6 and 12.1 of all cuda-related libraries. I installed version 530 of the drivers and removed the 11.6 versions of the libraries, and that made the llama-dfdx example work.

@opfromthestart
Copy link
Contributor Author

When I try to use the 525 version of drivers I get the following error

Caused by:
  process didn't exit successfully: `/home/opfromthestart/rust/game/touhou-diff/target/release/build/dfdx-5455800ceba8656f/build-script-build` (exit status: 101)
  --- stdout
  cargo:rerun-if-changed=build.rs
  cargo:rustc-cfg=feature="nightly"
  cargo:rustc-env=CUDA_INCLUDE_DIR=/usr/local/cuda/include
  cargo:rerun-if-changed=src/tensor_ops/utilities/binary_op_macros.cuh
  cargo:rerun-if-changed=src/tensor_ops/utilities/compatibility.cuh
  cargo:rerun-if-changed=src/tensor_ops/utilities/cuda_utils.cuh
  cargo:rerun-if-changed=src/tensor_ops/utilities/unary_op_macros.cuh

  --- stderr
  thread 'main' panicked at 'assertion failed: `(left == right)`
    left: `"Failed to initialize NVML: Driver/library version mismatch"`,
   right: `"compute_cap"`', /home/opfromthestart/.cargo/git/checkouts/dfdx-318e6e5ad83eea79/5e2b93d/build.rs:132:17

Which was why I upgraded to 530

@coreylowman
Copy link
Owner

Ahh okay, so are you still having the original error then about hmin/hmax?

Maybe we should be hooking into driver versions instead of GPU_ARCH for the ifdefs? I wonder if thats available...

@coreylowman
Copy link
Owner

Hmm it seems like getting driver version is limited to runtime. 🤔

@coreylowman coreylowman changed the title Error compiling CUDA kernels __hmin/__hmax already defined on compute_cap 75 with newer driver version May 5, 2023
@coreylowman coreylowman added the bug Something isn't working label May 5, 2023
@9876691
Copy link

9876691 commented May 13, 2023

I also get this error. I setup a vscode dev container with the following .devcontainer/devcontainer.json

{
	"name": "Rust",
	"image": "nvidia/cuda:12.1.1-devel-ubuntu20.04", 
	
	"runArgs": [
		"--gpus",
		"all"
	],
	"features": {
		"ghcr.io/devcontainers/features/rust:1": {}
	}
}

Running nvidia-smi in the container gives.

root@e5d2279e80a7:/workspaces/dfdx# nvidia-smi
Sat May 13 11:23:01 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:1D:00.0  On |                  N/A |
|  0%   39C    P0    N/A /  90W |   1273MiB /  4096MiB |      6%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Running cargo test --features "cuda"

root@e5d2279e80a7:/workspaces/dfdx# cargo test --features "cuda"
   Compiling dfdx v0.11.2 (/workspaces/dfdx)
error: failed to run custom build command for `dfdx v0.11.2 (/workspaces/dfdx)`

Caused by:
  process didn't exit successfully: `/workspaces/dfdx/target/debug/build/dfdx-30e6be024c8b3335/build-script-build` (exit status: 101)
  --- stdout
  cargo:rerun-if-changed=build.rs
  cargo:rustc-env=CUDA_INCLUDE_DIR=/usr/local/cuda/include
  cargo:rerun-if-changed=src/tensor_ops/utilities/binary_op_macros.cuh
  cargo:rerun-if-changed=src/tensor_ops/utilities/compatibility.cuh
  cargo:rerun-if-changed=src/tensor_ops/utilities/cuda_utils.cuh
  cargo:rerun-if-changed=src/tensor_ops/utilities/unary_op_macros.cuh
  cargo:rerun-if-env-changed=CUDA_COMPUTE_CAP
  cargo:rustc-env=CUDA_COMPUTE_CAP=sm_61
  cargo:rerun-if-changed=src/optim/adam/adam.cu
  cargo:rerun-if-changed=src/optim/rmsprop/rmsprop.cu
  cargo:rerun-if-changed=src/optim/sgd/sgd.cu
  cargo:rerun-if-changed=src/tensor_ops/abs/abs.cu
  cargo:rerun-if-changed=src/tensor_ops/add/binary_add.cu
  cargo:rerun-if-changed=src/tensor_ops/add/scalar_add.cu
  cargo:rerun-if-changed=src/tensor_ops/attention_reshape/attention_reshape.cu
  cargo:rerun-if-changed=src/tensor_ops/axpy/axpy.cu
  cargo:rerun-if-changed=src/tensor_ops/bce/bce.cu
  cargo:rerun-if-changed=src/tensor_ops/boolean/boolean.cu
  cargo:rerun-if-changed=src/tensor_ops/choose/choose.cu
  cargo:rerun-if-changed=src/tensor_ops/clamp/clamp.cu
  cargo:rerun-if-changed=src/tensor_ops/cmp/cmp.cu
  cargo:rerun-if-changed=src/tensor_ops/conv2d/conv2d.cu
  cargo:rerun-if-changed=src/tensor_ops/convtrans2d/convtrans2d.cu
  cargo:rerun-if-changed=src/tensor_ops/cos/cos.cu
  cargo:rerun-if-changed=src/tensor_ops/div/binary_div.cu
  cargo:rerun-if-changed=src/tensor_ops/div/scalar_div.cu
  cargo:rerun-if-changed=src/tensor_ops/dropout/dropout.cu
  cargo:rerun-if-changed=src/tensor_ops/exp/exp.cu
  cargo:rerun-if-changed=src/tensor_ops/gelu/gelu.cu
  cargo:rerun-if-changed=src/tensor_ops/huber_error/huber_error.cu
  cargo:rerun-if-changed=src/tensor_ops/ln/ln.cu
  cargo:rerun-if-changed=src/tensor_ops/max_to/max_to.cu
  cargo:rerun-if-changed=src/tensor_ops/maximum/maximum.cu
  cargo:rerun-if-changed=src/tensor_ops/min_to/min_to.cu
  cargo:rerun-if-changed=src/tensor_ops/minimum/minimum.cu
  cargo:rerun-if-changed=src/tensor_ops/mul/binary_mul.cu
  cargo:rerun-if-changed=src/tensor_ops/mul/scalar_mul.cu
  cargo:rerun-if-changed=src/tensor_ops/nans_to/nans_to.cu
  cargo:rerun-if-changed=src/tensor_ops/negate/negate.cu
  cargo:rerun-if-changed=src/tensor_ops/pool2d/pool2d.cu
  cargo:rerun-if-changed=src/tensor_ops/pow/pow.cu
  cargo:rerun-if-changed=src/tensor_ops/recip/recip.cu
  cargo:rerun-if-changed=src/tensor_ops/relu/relu.cu
  cargo:rerun-if-changed=src/tensor_ops/roll/roll.cu
  cargo:rerun-if-changed=src/tensor_ops/select_and_gather/gather.cu
  cargo:rerun-if-changed=src/tensor_ops/select_and_gather/select.cu
  cargo:rerun-if-changed=src/tensor_ops/sigmoid/sigmoid.cu
  cargo:rerun-if-changed=src/tensor_ops/sin/sin.cu
  cargo:rerun-if-changed=src/tensor_ops/slice/slice.cu
  cargo:rerun-if-changed=src/tensor_ops/sqrt/sqrt.cu
  cargo:rerun-if-changed=src/tensor_ops/square/square.cu
  cargo:rerun-if-changed=src/tensor_ops/sub/binary_sub.cu
  cargo:rerun-if-changed=src/tensor_ops/sub/scalar_sub.cu
  cargo:rerun-if-changed=src/tensor_ops/sum_to/sum_to.cu
  cargo:rerun-if-changed=src/tensor_ops/tanh/tanh.cu
  cargo:rerun-if-changed=src/tensor_ops/upscale2d/upscale2d.cu

  --- stderr
  thread 'main' panicked at 'nvcc error while compiling "src/optim/adam/adam.cu":

  # stdout


  # stderr
  src/tensor_ops/utilities/compatibility.cuh(9): error: function "__hmax" has already been defined
    __attribute__((device)) __inline__ __attribute__((always_inline)) __half __hmax(__half a, __half b) {
                                                                             ^

  src/tensor_ops/utilities/compatibility.cuh(12): error: function "__hmin" has already been defined
    __attribute__((device)) __inline__ __attribute__((always_inline)) __half __hmin(__half a, __half b) {
                                                                             ^

  2 errors detected in the compilation of "src/optim/adam/adam.cu".
  ', build.rs:197:17
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

nvcc --version

root@e5d2279e80a7:/workspaces/dfdx# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants