Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuBLAS with llama-cpp-python on Windows #117

Closed
Priestru opened this issue Apr 26, 2023 · 11 comments
Closed

cuBLAS with llama-cpp-python on Windows #117

Priestru opened this issue Apr 26, 2023 · 11 comments
Labels
bug Something isn't working build oobabooga https://github.com/oobabooga/text-generation-webui windows A Windoze-specific issue

Comments

@Priestru
Copy link

Priestru commented Apr 26, 2023

cuBLAS with llama-cpp-python on Windows.

Well, it works on WSL for me as intended but no tricks of mine help me to make it work using llama.dll in Windows.
I try it daily for the last week changing one thing or another. Asked friend to try it on a different system but he found no success either.

At this point i sincerely wonder if anyone ever made this work.
I don't expect any help regarding this issue, but If anyone could confirm that it does work in Windows and is possible at all as a concept it would surely encourage me to continue this struggle.

(there is no issue to make original llama.cpp to work with cuBLAS anywhere, issue lies with a wrapper for me)

I'll post error just in case:

File E:\LLaMA\llama-cpp-python\llama_cpp\llama_cpp.py:54, in _load_shared_library(lib_base_name)
     53 try:
---> 54     return ctypes.CDLL(str(_lib_path))
     55 except Exception as e:

File ~\AppData\Local\Programs\Python\Python310\lib\ctypes\__init__.py:374, in CDLL.__init__(self, name, mode, handle, use_errno, use_last_error, winmode)
    373 if handle is None:
--> 374     self._handle = _dlopen(self._name, mode)
    375 else:

FileNotFoundError: Could not find module 'E:\LLaMA\llama-cpp-python\llama_cpp\llama.dll' (or one of its dependencies). Try using the full path with constructor syntax.

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
Cell In[1], line 1
----> 1 from llama_cpp import Llama
      2 llm = Llama(model_path="./models/7B/ggml-model.bin")
      3 output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)

File E:\LLaMA\llama-cpp-python\llama_cpp\__init__.py:1
----> 1 from .llama_cpp import *
      2 from .llama import *

File E:\LLaMA\llama-cpp-python\llama_cpp\llama_cpp.py:67
     64 _lib_base_name = "llama"
     66 # Load the library
---> 67 _lib = _load_shared_library(_lib_base_name)
     69 # C types
     70 llama_context_p = c_void_p

File E:\LLaMA\llama-cpp-python\llama_cpp\llama_cpp.py:56, in _load_shared_library(lib_base_name)
     54             return ctypes.CDLL(str(_lib_path))
     55         except Exception as e:
---> 56             raise RuntimeError(f"Failed to load shared library '{_lib_path}': {e}")
     58 raise FileNotFoundError(
     59     f"Shared library with base name '{lib_base_name}' not found"
     60 )

RuntimeError: Failed to load shared library 'E:\LLaMA\llama-cpp-python\llama_cpp\llama.dll': Could not find module 'E:\LLaMA\llama-cpp-python\llama_cpp\llama.dll' (or one of its dependencies). Try using the full path with constructor syntax.

And here the Llama DLL that doesn't work:
llama.zip

@shengkaixuan
Copy link

how did you do cuBLAS with llama-cpp-python on linux?
looking forward to your reply!

@Priestru
Copy link
Author

how did you do cuBLAS with llama-cpp-python on linux? looking forward to your reply!

My method is stupid af.

Key thingies:
write export PATH=/usr/local/cuda/bin:$PATH for no reason.
then go to ./vendor/llama.cpp and change makefile:

delete line 107
ifdef LLAMA_CUBLAS
and line 116
endif

We are barbarians now, can't go wrong with setting argument if we leave it no choice right.

Then i go to CMakeLists.txt and change line 69 to

option(LLAMA_CUBLAS "llama: use cuBLAS" ON)

after that i check if .\vendor\llama.cpp has libllama.so, and delete it if it does. Now we can go back to llama-cpp-python and try to build it.

export LLAMA_CUBLAS=1
LLAMA_CUBLAS=1 python3 setup.py develop

This way i try to set argument LLAMA_CUBLAS=1 about 4 times in a row in different places. Yet it always takes me multiple attempts to finally build it. And it always eventually does, but as you can see the way i do it isn't straightforward. At this point i don't know what is actually important so i just repeat steps.

@gjmulder
Copy link
Contributor

Can you run nvidia-smi when calling llama-cpp-python and observe GPU%, please?

@Priestru
Copy link
Author

Priestru commented Apr 29, 2023

Can you run nvidia-smi when calling llama-cpp-python and observe GPU%, please?

If you ask about windows one, then it has nothing cuda related in it. I can't build working dll with cuda.
On other hand WSL does work with cuda just fine for me (until it's the latest release of llama.cpp that causes CUDA OOM)

@gjmulder
Copy link
Contributor

If you ask about windows one, then it has nothing cuda related in it. I can't build working dll with cuda.
On other hand WSL does work with cuda just fine for me (until it's the latest release of llama.cpp that causes CUDA OOM)

I seem to have progressed further than you w/Ubuntu?

@Priestru
Copy link
Author

If you ask about windows one, then it has nothing cuda related in it. I can't build working dll with cuda.
On other hand WSL does work with cuda just fine for me (until it's the latest release of llama.cpp that causes CUDA OOM)

I seem to have progressed further than you w/Ubuntu?

Well, this one: ggerganov/llama.cpp#1207

doesn't work for me in WSL Ubuntu, opened issue here: ggerganov/llama.cpp#1230

But previous way of memory allocation in GPU worked for me just fine and still works.
In case of windows i still haven't built working llama.dll for once that has cuBLAS within it.

@gjmulder
Copy link
Contributor

I have a working dynamic Linux binary. It frustratingly doesn't however generate any GPU%, independent of the batch size.

@Priestru
Copy link
Author

I have a working dynamic Linux binary. It frustratingly doesn't however generate any GPU%, independent of the batch size.

you mean libllama.so?

@gjmulder
Copy link
Contributor

I have a working dynamic Linux binary. It frustratingly doesn't however generate any GPU%, independent of the batch size.

you mean libllama.so?

Yes, in that it seems to load fine from within python 3.10.10.

@Priestru
Copy link
Author

Priestru commented Apr 29, 2023

I have a working dynamic Linux binary. It frustratingly doesn't however generate any GPU%, independent of the batch size.

you mean libllama.so?

Yes, in that it seems to load fine from within python 3.10.10.

Well, this one works for me too. This is basically the very reason i use WSL at all. It does give me working libllama.so with enabled cuBLAS.

I don't use llama-cpp-python as standalone in Ubuntu, i call it with oobabooga webui, which then i use just as api.

I build cuda-enabled llama-cpp-python's libllama.so and feed it to ooba, then i launch it like that:

(textgen) root@DESKTOP-61FF5OF:/mnt/e/LLaMA/Ubuntu/text-generation-webui# python3 server.py --listen --model-dir /media/ --chat --verbose --api --model ./ggml-vic13b-q5_0.bin --n_batch 512 --threads 12
Gradio HTTP request redirected to localhost :)
bin /root/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda121.so
Loading ./ggml-vic13b-q5_0.bin...
llama.cpp weights detected: /media/ggml-vic13b-q5_0.bin

llama.cpp: loading model from /media/ggml-vic13b-q5_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 8 (mostly Q5_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  73.73 KB
llama_model_load_internal: mem required  = 10583.25 MB (+ 1608.00 MB per state)
llama_init_from_file: kv self size  = 1600.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
Starting streaming server at ws://0.0.0.0:5005/api/v1/stream
Loading the extension "gallery"... Starting API at http://0.0.0.0:5000/api
Ok.
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.

and return back to windows, where i run https://github.com/Cohee1207/SillyTavern and https://github.com/Cohee1207/TavernAI-extras

which links to WSL's api and this all comes together.

And yeah cuda works cus i can see and feel it
image

The problem for me and reason why this issue is still here is that i can't make working llama.dll in windows, to exclude WSL from my setup.

@gjmulder
Copy link
Contributor

Can you try the script in #182?

@gjmulder gjmulder added the windows A Windoze-specific issue label May 18, 2023
@gjmulder gjmulder added the oobabooga https://github.com/oobabooga/text-generation-webui label Jun 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working build oobabooga https://github.com/oobabooga/text-generation-webui windows A Windoze-specific issue
Projects
None yet
Development

No branches or pull requests

4 participants