Installing Llama.cpp without `nvcc` from `nvidia-cuda-toolkit` #3459

jamesbraza · 2023-10-03T21:26:50Z

On an AWS EC2 g4dn.4xlarge (Ubuntu 22.04.2, x86_64, cuda apt package installed for cuBLAS support, NVIDIA Tesla T4), I am trying to install Llama.cpp with cuBLAS acceleration.

I am getting an error during pip install (see dropdown below). abetlen/llama-cpp-python#205 hit this same issue, and the solution was apt install nvidia-cuda-toolkit for nvcc.

However, check this: https://forums.developer.nvidia.com/t/impossible-resolution-ubuntu-22-04/, currently it's impossible for apt to resolve cuda (for cuBLAS) and nvidia-cuda-toolkit (for nvcc).

Is there something I can do to install Llama.cpp with cuBLAS support, but not having nvcc?

pip install fail

CMAKE_ARGS="-DLLAMA_CUBLAS=on" python -m pip install -r llama-cpp-python[server]==0.2.11 \
    --progress-bar off --no-cache-dir
...
Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Building wheel for llama-cpp-python (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [36 lines of output]
      *** scikit-build-core 0.5.1 using CMake 3.27.6 (wheel)
      *** Configuring CMake...
      loading initial cache file /tmp/tmpsezaa2fg/build/CMakeInit.txt
      -- The C compiler identification is GNU 11.4.0
      -- The CXX compiler identification is GNU 11.4.0
      -- Detecting C compiler ABI info
      -- Detecting C compiler ABI info - done
      -- Check for working C compiler: /usr/bin/cc - skipped
      -- Detecting C compile features
      -- Detecting C compile features - done
      -- Detecting CXX compiler ABI info
      -- Detecting CXX compiler ABI info - done
      -- Check for working CXX compiler: /usr/bin/c++ - skipped
      -- Detecting CXX compile features
      -- Detecting CXX compile features - done
      -- Found Git: /usr/bin/git (found version "2.34.1")
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
      -- Found Threads: TRUE
      -- Found CUDAToolkit: /usr/local/cuda/include (found version "12.2.140")
      -- cuBLAS found
      -- The CUDA compiler identification is unknown
      CMake Error at /tmp/pip-build-env-xoh3dnhj/normal/lib/python3.11/site-packages/cmake/data/share/cmake-3.27/Modules/CMakeDetermineCUDACompiler.cmake:603 (message):
        Failed to detect a default CUDA architecture.



        Compiler output:

      Call Stack (most recent call first):
        vendor/llama.cpp/CMakeLists.txt:290 (enable_language)


      -- Configuring incomplete, errors occurred!

      *** CMake configuration failed
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for llama-cpp-python
Failed to build llama-cpp-python
ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects

Environment and Context

ubuntu@ip-123:~$ lscpu
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  16
  On-line CPU(s) list:   0-15
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz
    CPU family:          6
    Model:               85
    Thread(s) per core:  2
    Core(s) per socket:  8
    Socket(s):           1
    Stepping:            7
    BogoMIPS:            4999.99
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid
                          aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_si
                         ngle pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku o
                         spke avx512_vnni
Virtualization features:
  Hypervisor vendor:     KVM
  Virtualization type:   full
Caches (sum of all):
  L1d:                   256 KiB (8 instances)
  L1i:                   256 KiB (8 instances)
  L2:                    8 MiB (8 instances)
  L3:                    35.8 MiB (1 instance)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-15
Vulnerabilities:
  Gather data sampling:  Unknown: Dependent on hypervisor status
  Itlb multihit:         KVM: Mitigation: VMX unsupported
  L1tf:                  Mitigation; PTE Inversion
  Mds:                   Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
  Meltdown:              Mitigation; PTI
  Mmio stale data:       Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
  Retbleed:              Vulnerable
  Spec store bypass:     Vulnerable
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

ubuntu@ip-123:~$ uname -a
Linux ip-123 6.2.0-1012-aws #12~22.04.1-Ubuntu SMP Thu Sep  7 14:01:24 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

ubuntu@ip-123:~$ python3 --version
Python 3.10.12

ubuntu@ip-123:~$ make --version
GNU Make 4.3
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

ubuntu@ip-123:~$ g++ --version
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

ubuntu@ip-123:~$ nvidia-smi
Tue Oct  3 21:21:11 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000000:00:1E.0 Off |                    0 |
| N/A   27C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

The text was updated successfully, but these errors were encountered:

jamesbraza · 2023-10-03T23:00:38Z

Cc @abetlen and @gjmulder since they were responding a bunch in the other issue

jamesbraza · 2023-10-03T23:19:37Z

With a locally git cloned Llama.cpp b1317, make with cuBLAS acceleration errors due to nvcc not being present:

> make LLAMA_CUBLAS=1
I llama.cpp build info:
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native
I NVCCFLAGS: --forward-unknown-to-host-compiler -use_fast_math -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread    -Wno-pedantic -Xcompiler "-Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native "
I LDFLAGS:   -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC:        cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:       g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native    -c ggml.c -o ggml.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  -c llama.cpp -o llama.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  -c common/common.cpp -o common.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  -c common/console.cpp -o console.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  -c common/grammar-parser.cpp -o grammar-parser.o
cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native  -c k_quants.c -o k_quants.o
nvcc --forward-unknown-to-host-compiler -use_fast_math -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread    -Wno-pedantic -Xcompiler "-Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native " -c ggml-cuda.cu -o ggml-cuda.o
/bin/sh: 1: nvcc: not found
make: *** [Makefile:411: ggml-cuda.o] Error 127

The relevant log:

/bin/sh: 1: nvcc: not found
make: *** [Makefile:411: ggml-cuda.o] Error 127

Is there any way we can get around nvcc not being present? For example, using a different tool like nvidia-smi as a stand-in for acquiring the info from nvcc

staviq · 2023-10-04T14:46:43Z

nvcc doesn't acquire any info, it is the compiler responsible for building that part of the code.

I might be wrong, but doesn't nvidia-cuda-toolkit already provide everything necesary to compile and run cuda ? Isn't installing cuda separately redundant here ?

Even if there are some system package shenanigans, you can simply install nvidia-cuda-toolkit, build the code, and uninstall it.

jamesbraza · 2023-10-04T17:20:28Z

Thanks @staviq I appreciate the follow-up. To explain the situation, perhaps more concisely:

cuBLAS gets installed by cuda as suggested in How to install with GPU support via cuBLAS and CUDA abetlen/llama-cpp-python#250 (comment)
- My machine is AWS EC2 g4dn.4xlarge, so it has one GPU. However, from https://stackoverflow.com/a/53493479
  - Before sudo apt install cuda: the command here has "no such directory", so cuBLAS isn't present
  - After sudo apt install cuda: the CUBLAS files are found
- I believe cuda is installing CUDA drivers (e.g. libnvidia-compute-535)
cuda and nvidia-cuda-toolkit currently can't be resolved together: https://forums.developer.nvidia.com/t/impossible-resolution-ubuntu-22-04/

I think I need cuda for LLAMA_CUBLAS=1, but I also need nvidia-cuda-toolkit for Llama.cpp build.

Even if there are some system package shenanigans, you can simply install nvidia-cuda-toolkit, build the code, and uninstall it.

Point 2 there is why I can't simply just temporarily install nvidia-cuda-toolkit.

Do you know of another way to get cuBLAS configured on EC2? I was observing nvidia-cuda-toolkit doesn't install cuBLAS support.

staviq · 2023-10-04T17:33:42Z

@1 comment you linked, specifically mentions "cuda toolkit" and explains it's needed for nvcc, which in your case is exactly what nvidia-cuda-toolkit provides.

Do not install cuda, install nvidia-cuda-toolkit.

jamesbraza · 2023-10-04T20:11:42Z

Okay, so before sudo apt install nvidia-cuda-toolkit:

> cat /usr/local/cuda/include/cublas.h | grep CUBLAS
cat: /usr/local/cuda/include/cublas.h: No such file or directory

Then running my apt installs (there are a few others):

> sudo apt update
...
> sudo apt install -y python3.11 python3.11-dev python3.11-venv \
    build-essential cmake make gcc \
    docker-ce docker-ce-cli docker-buildx-plugin docker-compose-plugin \
    nvidia-container-toolkit nvidia-cuda-toolkit
...
> sudo reboot

Afterwards, we see CUBLAS files aren't here, confirming they come from cuda:

> cat /usr/local/cuda/include/cublas.h | grep CUBLAS
cat: /usr/local/cuda/include/cublas.h: No such file or directory

Regardless though, continuing onwards:

> make LLAMA_CUBLAS=1
I llama.cpp build info:
I UNAME_S:   Linux
I UNAME_P:   x86_64
I UNAME_M:   x86_64
I CFLAGS:    -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native
I CXXFLAGS:  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native
I NVCCFLAGS: --forward-unknown-to-host-compiler -use_fast_math -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread    -Wno-pedantic -Xcompiler "-Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native "
I LDFLAGS:   -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/x86_64-linux/lib
I CC:        cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:       g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

cc  -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native    -c ggml.c -o ggml.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  -c llama.cpp -o llama.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  -c common/common.cpp -o common.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  -c common/console.cpp -o console.o
g++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native  -c common/grammar-parser.cpp -o grammar-parser.o
cc -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c11   -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wshadow -Wstrict-prototypes -Wpointer-arith -Wmissing-prototypes -Werror=implicit-int -Werror=implicit-function-declaration -Wdouble-promotion -pthread -march=native -mtune=native  -c k_quants.c -o k_quants.o
nvcc --forward-unknown-to-host-compiler -use_fast_math -arch=native -DGGML_CUDA_DMMV_X=32 -DGGML_CUDA_MMV_Y=1 -DK_QUANTS_PER_ITERATION=2 -DGGML_CUDA_PEER_MAX_BATCH_SIZE=128 -I. -Icommon -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -DNDEBUG -DGGML_USE_K_QUANTS -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/x86_64-linux/include  -std=c++11 -fPIC -O3 -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread    -Wno-pedantic -Xcompiler "-Wno-array-bounds -Wno-format-truncation -Wextra-semi -march=native -mtune=native " -c ggml-cuda.cu -o ggml-cuda.o
nvcc fatal   : Value 'native' is not defined for option 'gpu-architecture'
make: *** [Makefile:411: ggml-cuda.o] Error 1

Now let's try cmake (inspired by ggerganov/whisper.cpp#876 (comment)):

> mkdir build
> cd build
> cmake .. -DLLAMA_CUBLAS=ON
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.34.1")
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Found CUDAToolkit: /usr/include (found version "11.5.119")
-- cuBLAS found
-- The CUDA compiler identification is NVIDIA 11.5.119
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Using CUDA architectures: 52;61;70
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- x86 detected
-- Configuring done
-- Generating done
-- Build files have been written to: /home/ubuntu/code/llama.cpp/build
> cmake --build . --config Release
...

Yaay so cmake build worked!

EDIT: I made a mistake here, skip the remainder of this comment

Erroneous `llama-cpp-python` install

Now let's go back to installing llama-cpp-python[server]:

> CMAKE_ARGS="-DLLAMA_METAL=on" python -m pip install -r dev-requirements.txt \
    --progress-bar off --no-cache-dir

And I get an error:

Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Building wheel for llama-cpp-python (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [29 lines of output]
      *** scikit-build-core 0.5.1 using CMake 3.22.1 (wheel)
      *** Configuring CMake...
      loading initial cache file /tmp/tmp9s7tyvil/build/CMakeInit.txt
      -- The C compiler identification is GNU 11.4.0
      -- The CXX compiler identification is GNU 11.4.0
      -- Detecting C compiler ABI info
      -- Detecting C compiler ABI info - done
      -- Check for working C compiler: /usr/bin/cc - skipped
      -- Detecting C compile features
      -- Detecting C compile features - done
      -- Detecting CXX compiler ABI info
      -- Detecting CXX compiler ABI info - done
      -- Check for working CXX compiler: /usr/bin/c++ - skipped
      -- Detecting CXX compile features
      -- Detecting CXX compile features - done
      -- Found Git: /usr/bin/git (found version "2.34.1")
      -- Looking for pthread.h
      -- Looking for pthread.h - found
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
      -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
      -- Found Threads: TRUE
      CMake Error at vendor/llama.cpp/CMakeLists.txt:174 (find_library):
        Could not find FOUNDATION_LIBRARY using the following names: Foundation


      -- Configuring incomplete, errors occurred!
      See also "/tmp/tmp9s7tyvil/build/CMakeFiles/CMakeOutput.log".

      *** CMake configuration failed
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for llama-cpp-python
Failed to build llama-cpp-python

So now I am failing with Could not find FOUNDATION_LIBRARY using the following names: Foundation. Any ideas there? Also, officially I think this is now a problem with llama-cpp-python (not Llama.cpp), so I can also migrate this issue there

slaren · 2023-10-04T20:17:31Z

You are building for Metal instead of CUDA. Try CMAKE_ARGS="DLLAMA_CUBLAS=ON" instead.

staviq · 2023-10-04T20:17:52Z

CMAKE_ARGS="-DLLAMA_METAL=on" python -m pip install -r dev-requirements.txt
--progress-bar off --no-cache-dir

Why are you specifying METAL, that's a Mac/OSX framework

jamesbraza · 2023-10-04T20:25:10Z

Yep that's a mix up, my bad guys. Now pasting in the right command...

> CMAKE_ARGS="-DLLAMA_CUBLAS=on" python -m pip install -r dev-requirements.txt \
    --progress-bar off --no-cache-dir

Works fine now, wow. So takeaways here are one doesn't need the cuda apt package for cuBLAS acceleration, nvidia-cuda-toolkit alone works for cuBLAS acceleration too.

Is there something I can do to help document or make this clear somewhere? Otherwise we can just close this out as resolved with user error

AmericanPresidentJimmyCarter · 2024-05-22T20:31:29Z

For everyone else looking for the compile time flag to indicate the right nvcc location, it's LLAMA_CUDA_NVCC. This should really be in the README.

eg

LLAMA_CUDA_NVCC=/usr/local/cuda-12.4/bin/nvcc make LLAMA_CUDA=1 -j4

jamesbraza mentioned this issue Oct 5, 2023

[Request] nice-ifying build failures around hardware acceleration abetlen/llama-cpp-python#793

Open

BarfingLemurs mentioned this issue Oct 6, 2023

readme : update models + clearer linux cuda and perplexity instructions #3510

Merged

ggerganov closed this as completed in #3510 Oct 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Installing Llama.cpp without `nvcc` from `nvidia-cuda-toolkit` #3459

Installing Llama.cpp without `nvcc` from `nvidia-cuda-toolkit` #3459

jamesbraza commented Oct 3, 2023 •

edited

jamesbraza commented Oct 3, 2023

jamesbraza commented Oct 3, 2023 •

edited

staviq commented Oct 4, 2023

jamesbraza commented Oct 4, 2023 •

edited

staviq commented Oct 4, 2023

jamesbraza commented Oct 4, 2023 •

edited

slaren commented Oct 4, 2023

staviq commented Oct 4, 2023

jamesbraza commented Oct 4, 2023

AmericanPresidentJimmyCarter commented May 22, 2024

Installing Llama.cpp without nvcc from nvidia-cuda-toolkit #3459

Installing Llama.cpp without nvcc from nvidia-cuda-toolkit #3459

Comments

jamesbraza commented Oct 3, 2023 • edited

Environment and Context

jamesbraza commented Oct 3, 2023

jamesbraza commented Oct 3, 2023 • edited

staviq commented Oct 4, 2023

jamesbraza commented Oct 4, 2023 • edited

staviq commented Oct 4, 2023

jamesbraza commented Oct 4, 2023 • edited

slaren commented Oct 4, 2023

staviq commented Oct 4, 2023

jamesbraza commented Oct 4, 2023

AmericanPresidentJimmyCarter commented May 22, 2024

Installing Llama.cpp without `nvcc` from `nvidia-cuda-toolkit` #3459

Installing Llama.cpp without `nvcc` from `nvidia-cuda-toolkit` #3459

jamesbraza commented Oct 3, 2023 •

edited

jamesbraza commented Oct 3, 2023 •

edited

jamesbraza commented Oct 4, 2023 •

edited

jamesbraza commented Oct 4, 2023 •

edited