Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

__CUDA_ARCH__ macro is unreliable #6529

Closed
Avlyssna opened this issue Apr 7, 2024 · 3 comments
Closed

__CUDA_ARCH__ macro is unreliable #6529

Avlyssna opened this issue Apr 7, 2024 · 3 comments

Comments

@Avlyssna
Copy link

Avlyssna commented Apr 7, 2024

I discovered this issue while trying to utilize llama.cpp through llama-cpp-python, but it looks like the root issue may reside with llama.cpp. I'm getting errors during execution complaining that /llama.cpp/ggml-cuda/convert.cu:64: ERROR: CUDA kernel dequantize_block_q8_0_f16 has no device code compatible with CUDA arch 520. ggml-cuda.cu was compiled for: 520. This was previously reported, but closed shortly after with no clarity on the fix. This primarily happens when trying to leverage functionary v2 for tool selection; chatting normally works fine. I am using 2 x 3090 graphics cards (they are not linked using NVLink) with driver version 550.54.15 and CUDA version 12.4 (update 1) on Debian x86_64.

Despite being on a relatively new driver with the latest CUDA version, the __CUDA_ARCH__ macro reports the version as 520, which causes functionality designed for PASCAL and higher (CC_PASCAL = 600) to fail. If I'm not mistaken, this value should be 860 for 3090 series cards, but it clearly isn't.

I used this code to confirm the __CUDA_ARCH__ macro value:

#include <cstdio>
#define STR_HELPER(x) #x
#define STR(x) STR_HELPER(x)

__device__ void print_arch(){
  const char my_compile_time_arch[] = STR(__CUDA_ARCH__);
  printf("__CUDA_ARCH__: %s\n", my_compile_time_arch);
}
__global__ void example()
{
   print_arch();
}

int main(){

example<<<1,1>>>();
cudaDeviceSynchronize();
}
@Avlyssna
Copy link
Author

Avlyssna commented Apr 7, 2024

As an additional note, I attempted to compile llama.cpp's latest commit (855f544) with the CC_PASCAL, MIN_CC_DP4A, and CC_VOLTA values set to 0. Once I had a fresh libllama.so, I overwrote the old libllama.so and it magically started working. Not sure the best way to fix this check in llama.cpp, but it really looks like __CUDA_ARCH__ can't be trusted for feature availability. 😢

@JohannesGaessler
Copy link
Collaborator

The problem is that you are compiling the llama.cpp code for compute capability 5.2 which is the default for CUDA 12. But the code needs compute capability 6.1 or higher. In llama.cpp proper the code is either compiled for the compute capability of the GPU in the system (make) or for compute capabilities 5.2, 6.1, and 7.0 (cmake). Otherwise the CPU code will at runtime select a kernel that does not have device code. The fix is to modify whichever command you're using for compilation and to set the correct CUDA architecture, e.g. via -arch=native.

@Avlyssna
Copy link
Author

Avlyssna commented Apr 8, 2024

@JohannesGaessler - That was the problem! There was an issue with my environment variables in the build pipeline, so the project was failing to pass the correct make flags from the start. Thanks for the help, closing this issue now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants