-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
is it normal 30-minute/token slowness in intel xeon? #250
Comments
Nope - not really. I'm running on an older CPU without issues, but you might be missing a little RAM - if your screenshot is without running a model |
you're right, I noticed strange behavior while using the template. The swap memory (usually 12gb) is in 90% beyond the memory itself ram (4gb) totaling 16gb. I think I need 16gb of ram instead of 4 ram + 12 swap to speed up performance. |
I have the same issue, I'm running Windows 10 on a Intel i7@2.80GHz, 32GB ram and Nvidia GTX1050. The 7B model is painfully slow to run, it uses less than 40% of CPU and 4GB of RAM (I have more than 20GB left free), and it doesn't use the GPU. |
Same here. I have a 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz. I will update if I manage to understand the issue.. |
Sorry for the late feedback i'm running on 64 and 128 GB machines. the only thing that gave my setup a speed boost was increasing the threads from 4 to 18. |
Memory is pretty straightforward (from the docs): 7B => ~4 GB Those are optimistic estimates, add +1GB each. @michaelwdombek how many cores/threads does your CPU support? Did you see a performance increase setting the threads above your CPU threads? |
Never actually used more threads than i have and just tried it. Turns out it gets really , really slow if I oversubscribe. So i would recommend against that. |
I saw in another personal forum saying that there was slowness with docker. Are you running with or without docker? |
I'm running in the docker compose setup provided here. So nope does not look like a docker problem. I'll setup a test server on my oldest xeon to check for speed and post an update later today. |
I'm waiting for you. I'm also installing another application on top of llama.cpp to test performance outside of docker. I intend to go back today too. |
Ok, as promised I set dalai up using docker on my Server
I checked the times and get roughly 2 - 3 token per second output with llama 13B but this includes loading the model to ram and suff |
Do you use the same definition of token used by Openai, that is, 1 token ~
750 words?
Can you try to set the threads to 8 or 4? Just out of curiosity to
understand the real impact of the # of threads
…On Wed, Mar 29, 2023, 10:31 Michael ***@***.***> wrote:
Ok, as promised I set dalai up using docker on my Server
- 1x Intel(R) Xeon(R) CPU E5-4620 v4 @ 2.10GHz
- 64 GB RAM
- ubuntu 22.04 with 5.15.0-69
- nothing tuned just plain installed ubuntu and docker
I checked the times and get roughly 2 - 3 token per second output with
llama 13B but this includes loading the model to ram and suff
—
Reply to this email directly, view it on GitHub
<#250 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEHB3OAHANPQEF6MSK7HB3W6PXMVANCNFSM6AAAAAAWGCJZJM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Hey so i did some more experimenting @dgasparri :)
Input and output tokens are calculated using openAIs Tokenizer API and the time includes loading of the model. |
I installed it outside of docker and did not get different results from those already mentioned above. What is it? Because I've seen some people running alpaca 7B and it loads and responds in seconds, and on my machine which is a powerful computer for servers, even the 7B is extremely slow, even consuming 6 cores (I upgraded from intel xeon to amd epyc which reduced the answer to 8 minutes and the load to 18 minutes) |
Dalai loads faster than Serge. :D |
@voarsh2 could you provide your evidence? I tested it and the results are apparently the opposite (faster serge) |
Some performance improvements are merged into llama.cpp two days ago here. Are they automatically reflected in this project? |
I think Dalai doesn't like Xeon's I have the exact same problem... |
it's not with intel xeon. I changed the server to a vps with amd epyc (6nucleos 16gb ram) and it accelerated very little. It made it possible to make the model run, but still without conditions of use, extremely slow. The slowness, from what I could analyze, comes from the llama.cpp repository when in some version to optimize the algorithm, in fact, it brought problems of intermittent performance. For some machines with windows, the model does very well, responding almost instantly compared to my linux servers... It is strange that there is an intense discussion inside the issues in an attempt to discover the mystery of the failure of the update that was supposed to give a boost in model performance. (obviously failed) please don't blame the dalai development people. Dalai uses the llama.cpp library under the covers, as well as the serge chat that presents the same slowness problem. In fact, they all use llama.cpp, and the real problem is in some code in that library. Complain to the developers of llama.cpp or develop your own lib. |
I'm running alpaca on my Intel Xeon Silver 4214(4) @ 2.194GHz and I'm having a lot of slowness for artificial intelligence to generate some response. It takes between 20-40 minutes per word to be generated, is this normal? I saw many running in setups even lower and with faster results. I wonder if there's anything wrong with my server.
The text was updated successfully, but these errors were encountered: