-
Notifications
You must be signed in to change notification settings - Fork 960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incredibly slow response time #49
Comments
Hi @MartinPJB , it looks like the package was built with the correct optimizations, could you pass Also, if possible, can you try building the regular |
Hi @abetlen, I did pass verbose=True when instantiating the Llama class: llm = Llama(model_path="./model/ggml-model-q4_0_new.bin", verbose=True) I assume it is what you meant. However, it is not printing any details in the console after the submission of my prompt. Since you said you built the package with the correct optimizations, maybe it is coming from my computer itself, however, I do not have any idea why it is so slow. By the way, I am not familiar with building the regular |
If you follow these instructions you should be able to test it out https://github.com/ggerganov/llama.cpp#usage |
No worries. I am gonna do what's needed and share you the results in a bit / tomorrow, if that is good for you. |
For what it's worth, I am also getting a lot slower generations vs. natively interacting with llama.cpp in the terminal |
@abetlen I did some testing on my M1 Mac, and here are my results (llama.cpp is faster) Stats for calling
Llama.cpp
Here are the model loads (looks the same to me) llama-cpp-python
llama.cpp
Thoughts? Am I doing this right? I just added |
What's the difference between prompt_eval time and eval time? seems like llama.cpp is higher on eval time for some reason 🤷 |
Ran a couple more runs for
|
For some reason, it seems like no matter what model I use, the response time stays around the same (Between 18 and 25 minutes). I did a series of test tonight and all of them returned around the same execution time. I did some tests as well using llama.cpp, and it doesn't seem to change either, llama.cpp takes a long time to respond on my computer as well. Do any of you have an idea of what I am doing wrong? Thanks. |
Update: |
With LLaMA.cpp:
With llama-cpp-python:
It is significantly slower on orders of magnitude |
@acheong08 I see, although it shouldn't take around 18 minutes to generate an output, should it? |
It shouldn't. Make sure you have the latest version of llama.cpp installed and |
I see, I'll try that and I'll keep you updated. By the way, is there any way to check how much memory is used by the binding during its execution? |
System monitor / task manager and look for python processes |
I'm still not sure why this python binding is so much slower than the standard llama.cpp though |
@acheong08 I have one running right now, I've launched it around 5 minutes ago, I'll keep you updated when it ends and how long it lasted. |
Took 74 minutes this time. I really wonder what could be wrong. |
is llamacpp even supporting gpu? I don't see any function or something to move things to different hardware just as in pytorch etc usually? |
no |
Hi all, I also confirm there's a performance issue with this, but I looks to me the problem is only inside the high_level (fastapi) variant. The regular llama.cpp works fine on our server - it gives response to a simple instructions in a matter of seconds (it's a small soho server with 4 core i5 and a 16GB of memory). I also tried the low_level variant of this python binding and - interestingly enough - it works almost the same as llama.cpp (starts generating words in around ~5-6 seconds)! When I finally tried with the web server version (high_level variant) over the /docs API client, I got much worse performance - usually takes around 25 seconds to start generating anything :( I started tinkering around with the parameters and increased the number of threads from 2 to 4 and got somewhat better results (wait time is now around 16 seconds more or less), but still like 2x slower than llama.cpp variant. I also tried to enable mmap (had to increase the size using ulimit command), but no luck - still the same (except the memory is now almost full and the swap starts eating the space). I even tried playing with other parameters, but it seems nothing can speed up this server - which is so weird since it's supposed to be a web service extension over a python c++ wrapper (umm, right?). Didn't have time to dive deeper into the code, but my naive guess is that it looks like something is fishy in the high_level code that makes the generation slower than the low_level variant, but no idea what. Let me know guys if you need some other info, I'm very interested in getting this to work since we're using llms as micro services extensively. Cheers |
@bratislav I'll definitely look into this, can you provide some minimal examples? Also there are some high level api examples do you mind confirming the issue is there or in the server. There may be some weird asyncio bugs with the server so just trying to figure out where this could be coming from. I also provided a small example in this graph to show how you can generate a flamegraph to profile where the majority of time is being spent. That might help us track down wether it's a python issue or comes from different parameters / settings in llama.cpp #51 (comment) |
Look at your memory and disk utilization. If your ram is anywhere near 100% you're going to have bad performance if it's trying to use swap to compensate for a lack of memory. If your disc utilization goes up as you run the executable, that's a sign that bad things are happening. If it takes minutes to compute a response then something is definitely wrong with how things are running. |
Hi @abetlen, thanks for a quick response. Here's our soho server info: Ubuntu 20.04.5 LTS I have downloaded the alpaca and the llama models from the torrent found here, using magnet link at the top of the page...not sure if the models are the right ones, but here are the files I'm using: llama-7b-ggml-q4_0/ggml-model-q4_0.bin (haven't converted or modified them, I'm just running them) Okay, so the first run is the regular (latest) llama.cpp, downloaded and built yesterday:
Timings: Model loading (until first input shows): ~ 6 seconds Second run, I try the low_level python wrapper around the same llama.cpp version (downloaded into /vendor dir), on the same machine:
Not sure if I messed up something with the params, but this example in interactive mode doesn't provide any input, it just starts spitting out words - and stops after some number. Also, if I change the high_level example from chat to llama, it just segfault:
Timings of chat example: Model loading (until first word shows): almost immediately Third run, I try with high_level_api example, as you requested:
Timings: In a final run, I try the llama-cpp-python from the pip, firing up instructions from the /docs web client (v1/chat/completion endpoint with the default JSON request) - first instruction response shows after around the same time (25 seconds) and if I bump up the threads from 2 to 4 I get the response at around 12-16 seconds. Almost everything I've tried so far to speed this up even further - failed. Also, when I build python wheel from the source - the timings stays the same. Here's an example:
I can run all these examples on our production server, which is a 4 Xeon IBM beast with a 128GB of memory to see if there's any difference. TODO:
|
@abetlen Here are the outputs of the py-spy (usign 0.1.30): Low-level:
Here is for high_level API (took almost 25 seconds!):
|
Difference seems to be minimal now |
It seems like it is to me as well. It now takes just a few seconds instead of those 18 minutes I had before. I don't know what happened but I am glad it got fixed. Thanks! |
It might have just been a regression in the base llama.cpp library but who knows |
@abetlen @MartinPJB @acheong08 Guys, I've updated both llama.cpp and llama-cpp-python and I'm still getting the same timings as before - no change whatsoever :( I'm starting to think my model is wrong, are you guys willing to share the link from where you downloaded the .bin files (or at least sha256 sum so I can compare with mine)? Did you just create the *.bin file from weights or you have downloaded already-made file in ggml? Are you using ggml v2 maybe? Thanks |
@bratislav The issue was not with the bin files. I tested with the same model throughout. Specifically https://huggingface.co/Pi3141/gpt4-x-alpaca-native-13B-ggml |
make sure to |
this fetches you the latest version |
@acheong08 @abetlen @MartinPJB Unsure if the slowness here was related, but there was a bunch of performance regression stuff in the base lib that I was vaguely following that was resolved recently:
|
Hello.
I am still new to llama-cpp and I was wondering if it was normal that it takes an incredibly long time to respond to my prompt.
Fyi, I am assuming it runs on my CPU, here are my specs:
Everything else seems to work fine, the model could be load correctly (Or at least, it seems to be).
I did a first test using the code showcased in the README.md
which returned me this:
The output is what I expected (Even though Uranus, Neptune and Pluto were missing), but when I see the total time, it is extremely long (
1124707.08ms
,18 minutes
).I did this second code in order to try a bit to see what could be causing the insanely long response time but I don't know what's going on.
I may have done things wrong since I am still new to all of this, but do any of you have any idea on how I could speed up the process? I searched for solutions through google, github and different forums, but nothing seems to work.
PS: For those interested in the CLI output when it loads the model:
I apologize in advance if my english doesn't make sense sometimes, it is not my native language.
Thanks in advance for the help, regards. 👋
The text was updated successfully, but these errors were encountered: