cuBLAS with llama-cpp-python on Windows #117

Priestru · 2023-04-26T09:54:50Z

cuBLAS with llama-cpp-python on Windows.

Well, it works on WSL for me as intended but no tricks of mine help me to make it work using llama.dll in Windows.
I try it daily for the last week changing one thing or another. Asked friend to try it on a different system but he found no success either.

At this point i sincerely wonder if anyone ever made this work.
I don't expect any help regarding this issue, but If anyone could confirm that it does work in Windows and is possible at all as a concept it would surely encourage me to continue this struggle.

(there is no issue to make original llama.cpp to work with cuBLAS anywhere, issue lies with a wrapper for me)

I'll post error just in case:

File E:\LLaMA\llama-cpp-python\llama_cpp\llama_cpp.py:54, in _load_shared_library(lib_base_name)
     53 try:
---> 54     return ctypes.CDLL(str(_lib_path))
     55 except Exception as e:

File ~\AppData\Local\Programs\Python\Python310\lib\ctypes\__init__.py:374, in CDLL.__init__(self, name, mode, handle, use_errno, use_last_error, winmode)
    373 if handle is None:
--> 374     self._handle = _dlopen(self._name, mode)
    375 else:

FileNotFoundError: Could not find module 'E:\LLaMA\llama-cpp-python\llama_cpp\llama.dll' (or one of its dependencies). Try using the full path with constructor syntax.

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
Cell In[1], line 1
----> 1 from llama_cpp import Llama
      2 llm = Llama(model_path="./models/7B/ggml-model.bin")
      3 output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)

File E:\LLaMA\llama-cpp-python\llama_cpp\__init__.py:1
----> 1 from .llama_cpp import *
      2 from .llama import *

File E:\LLaMA\llama-cpp-python\llama_cpp\llama_cpp.py:67
     64 _lib_base_name = "llama"
     66 # Load the library
---> 67 _lib = _load_shared_library(_lib_base_name)
     69 # C types
     70 llama_context_p = c_void_p

File E:\LLaMA\llama-cpp-python\llama_cpp\llama_cpp.py:56, in _load_shared_library(lib_base_name)
     54             return ctypes.CDLL(str(_lib_path))
     55         except Exception as e:
---> 56             raise RuntimeError(f"Failed to load shared library '{_lib_path}': {e}")
     58 raise FileNotFoundError(
     59     f"Shared library with base name '{lib_base_name}' not found"
     60 )

RuntimeError: Failed to load shared library 'E:\LLaMA\llama-cpp-python\llama_cpp\llama.dll': Could not find module 'E:\LLaMA\llama-cpp-python\llama_cpp\llama.dll' (or one of its dependencies). Try using the full path with constructor syntax.

And here the Llama DLL that doesn't work:
llama.zip

The text was updated successfully, but these errors were encountered:

shengkaixuan · 2023-04-29T14:07:43Z

how did you do cuBLAS with llama-cpp-python on linux?
looking forward to your reply!

Priestru · 2023-04-29T14:38:22Z

how did you do cuBLAS with llama-cpp-python on linux? looking forward to your reply!

My method is stupid af.

Key thingies:
write export PATH=/usr/local/cuda/bin:$PATH for no reason.
then go to ./vendor/llama.cpp and change makefile:

delete line 107
ifdef LLAMA_CUBLAS
and line 116
endif

We are barbarians now, can't go wrong with setting argument if we leave it no choice right.

Then i go to CMakeLists.txt and change line 69 to

option(LLAMA_CUBLAS "llama: use cuBLAS" ON)

after that i check if .\vendor\llama.cpp has libllama.so, and delete it if it does. Now we can go back to llama-cpp-python and try to build it.

export LLAMA_CUBLAS=1
LLAMA_CUBLAS=1 python3 setup.py develop

This way i try to set argument LLAMA_CUBLAS=1 about 4 times in a row in different places. Yet it always takes me multiple attempts to finally build it. And it always eventually does, but as you can see the way i do it isn't straightforward. At this point i don't know what is actually important so i just repeat steps.

gjmulder · 2023-04-29T15:23:05Z

Can you run nvidia-smi when calling llama-cpp-python and observe GPU%, please?

Priestru · 2023-04-29T15:52:14Z

Can you run nvidia-smi when calling llama-cpp-python and observe GPU%, please?

If you ask about windows one, then it has nothing cuda related in it. I can't build working dll with cuda.
On other hand WSL does work with cuda just fine for me (until it's the latest release of llama.cpp that causes CUDA OOM)

gjmulder · 2023-04-29T16:05:27Z

If you ask about windows one, then it has nothing cuda related in it. I can't build working dll with cuda.
On other hand WSL does work with cuda just fine for me (until it's the latest release of llama.cpp that causes CUDA OOM)

I seem to have progressed further than you w/Ubuntu?

Priestru · 2023-04-29T16:08:36Z

If you ask about windows one, then it has nothing cuda related in it. I can't build working dll with cuda.
On other hand WSL does work with cuda just fine for me (until it's the latest release of llama.cpp that causes CUDA OOM)

I seem to have progressed further than you w/Ubuntu?

Well, this one: ggerganov/llama.cpp#1207

doesn't work for me in WSL Ubuntu, opened issue here: ggerganov/llama.cpp#1230

But previous way of memory allocation in GPU worked for me just fine and still works.
In case of windows i still haven't built working llama.dll for once that has cuBLAS within it.

gjmulder · 2023-04-29T16:11:06Z

I have a working dynamic Linux binary. It frustratingly doesn't however generate any GPU%, independent of the batch size.

Priestru · 2023-04-29T16:11:39Z

I have a working dynamic Linux binary. It frustratingly doesn't however generate any GPU%, independent of the batch size.

you mean libllama.so?

gjmulder · 2023-04-29T16:17:48Z

I have a working dynamic Linux binary. It frustratingly doesn't however generate any GPU%, independent of the batch size.

you mean libllama.so?

Yes, in that it seems to load fine from within python 3.10.10.

Priestru · 2023-04-29T16:23:12Z

I have a working dynamic Linux binary. It frustratingly doesn't however generate any GPU%, independent of the batch size.

you mean libllama.so?

Yes, in that it seems to load fine from within python 3.10.10.

Well, this one works for me too. This is basically the very reason i use WSL at all. It does give me working libllama.so with enabled cuBLAS.

I don't use llama-cpp-python as standalone in Ubuntu, i call it with oobabooga webui, which then i use just as api.

I build cuda-enabled llama-cpp-python's libllama.so and feed it to ooba, then i launch it like that:

(textgen) root@DESKTOP-61FF5OF:/mnt/e/LLaMA/Ubuntu/text-generation-webui# python3 server.py --listen --model-dir /media/ --chat --verbose --api --model ./ggml-vic13b-q5_0.bin --n_batch 512 --threads 12
Gradio HTTP request redirected to localhost :)
bin /root/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda121.so
Loading ./ggml-vic13b-q5_0.bin...
llama.cpp weights detected: /media/ggml-vic13b-q5_0.bin

llama.cpp: loading model from /media/ggml-vic13b-q5_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 8 (mostly Q5_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  73.73 KB
llama_model_load_internal: mem required  = 10583.25 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  = 1600.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
Starting streaming server at ws://0.0.0.0:5005/api/v1/stream
Loading the extension "gallery"... Starting API at http://0.0.0.0:5000/api
Ok.
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.

and return back to windows, where i run https://github.com/Cohee1207/SillyTavern and https://github.com/Cohee1207/TavernAI-extras

which links to WSL's api and this all comes together.

And yeah cuda works cus i can see and feel it

The problem for me and reason why this issue is still here is that i can't make working llama.dll in windows, to exclude WSL from my setup.

gjmulder · 2023-05-15T11:51:48Z

Can you try the script in #182?

Green-Sky mentioned this issue May 1, 2023

Support for ggml EleutherAI/lm-evaluation-harness#417

Closed

abetlen added bug Something isn't working build labels May 5, 2023

Priestru closed this as completed May 16, 2023

gjmulder added the windows A Windoze-specific issue label May 18, 2023

gjmulder added the oobabooga https://github.com/oobabooga/text-generation-webui label Jun 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cuBLAS with llama-cpp-python on Windows #117

cuBLAS with llama-cpp-python on Windows #117

Priestru commented Apr 26, 2023 •

edited

Loading

shengkaixuan commented Apr 29, 2023

Priestru commented Apr 29, 2023

gjmulder commented Apr 29, 2023

Priestru commented Apr 29, 2023 •

edited

Loading

gjmulder commented Apr 29, 2023

Priestru commented Apr 29, 2023

gjmulder commented Apr 29, 2023

Priestru commented Apr 29, 2023

gjmulder commented Apr 29, 2023

Priestru commented Apr 29, 2023 •

edited

Loading

gjmulder commented May 15, 2023

cuBLAS with llama-cpp-python on Windows #117

cuBLAS with llama-cpp-python on Windows #117

Comments

Priestru commented Apr 26, 2023 • edited Loading

shengkaixuan commented Apr 29, 2023

Priestru commented Apr 29, 2023

gjmulder commented Apr 29, 2023

Priestru commented Apr 29, 2023 • edited Loading

gjmulder commented Apr 29, 2023

Priestru commented Apr 29, 2023

gjmulder commented Apr 29, 2023

Priestru commented Apr 29, 2023

gjmulder commented Apr 29, 2023

Priestru commented Apr 29, 2023 • edited Loading

gjmulder commented May 15, 2023

Priestru commented Apr 26, 2023 •

edited

Loading

Priestru commented Apr 29, 2023 •

edited

Loading

Priestru commented Apr 29, 2023 •

edited

Loading