is it normal 30-minute/token slowness in intel xeon? #250

alph4b3th · 2023-03-24T04:36:39Z

I'm running alpaca on my Intel Xeon Silver 4214(4) @ 2.194GHz and I'm having a lot of slowness for artificial intelligence to generate some response. It takes between 20-40 minutes per word to be generated, is this normal? I saw many running in setups even lower and with faster results. I wonder if there's anything wrong with my server.

michaelwdombek · 2023-03-24T11:37:45Z

Nope - not really. I'm running on an older CPU without issues, but you might be missing a little RAM - if your screenshot is without running a model

alph4b3th · 2023-03-24T15:11:07Z

you're right, I noticed strange behavior while using the template. The swap memory (usually 12gb) is in 90% beyond the memory itself ram (4gb) totaling 16gb. I think I need 16gb of ram instead of 4 ram + 12 swap to speed up performance.
out of curiosity, how much ram does your computer have?

alph4b3th · 2023-03-25T21:40:38Z

hi, I have a vps (setup attached) and I'm having a slowness when running the model 7B, 13B. The Model in addition to taking about 3-8 minutes to be loaded, has a time of 3200ms/token.. This is abnormal behavior and so far I don't know what is causing it. Can anyone help me?

dgasparri · 2023-03-27T14:22:26Z

I have the same issue, I'm running Windows 10 on a Intel i7@2.80GHz, 32GB ram and Nvidia GTX1050.

The 7B model is painfully slow to run, it uses less than 40% of CPU and 4GB of RAM (I have more than 20GB left free), and it doesn't use the GPU.

RicRicci22 · 2023-03-28T07:04:22Z

Same here. I have a 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz.
I tried to insert the prompt and click go, but after 15 minutes of nothing I decided to break the process.. not sure what caused this bottleneck, but the utilization statistics were not going up.. it seems that it was not even loaded.

I will update if I manage to understand the issue..

michaelwdombek · 2023-03-28T11:35:06Z

you're right, I noticed strange behavior while using the template. The swap memory (usually 12gb) is in 90% beyond the memory itself ram (4gb) totaling 16gb. I think I need 16gb of ram instead of 4 ram + 12 swap to speed up performance. out of curiosity, how much ram does your computer have?

Sorry for the late feedback i'm running on 64 and 128 GB machines.

the only thing that gave my setup a speed boost was increasing the threads from 4 to 18.

dgasparri · 2023-03-28T13:17:23Z

Memory is pretty straightforward (from the docs):

7B => ~4 GB
13B => ~8 GB
30B => ~16 GB
65B => ~32 GB

Those are optimistic estimates, add +1GB each.

@michaelwdombek how many cores/threads does your CPU support? Did you see a performance increase setting the threads above your CPU threads?

michaelwdombek · 2023-03-28T17:49:50Z

Memory is pretty straightforward (from the docs):

7B => ~4 GB 13B => ~8 GB 30B => ~16 GB 65B => ~32 GB

Those are optimistic estimates, add +1GB each.

@michaelwdombek how many cores/threads does your CPU support? Did you see a performance increase setting the threads above your CPU threads?

Never actually used more threads than i have and just tried it. Turns out it gets really , really slow if I oversubscribe. So i would recommend against that.

alph4b3th · 2023-03-28T23:05:09Z

I saw in another personal forum saying that there was slowness with docker. Are you running with or without docker?

michaelwdombek · 2023-03-29T05:03:28Z

I saw in another personal forum saying that there was slowness with docker. Are you running with or without docker?

I'm running in the docker compose setup provided here. So nope does not look like a docker problem.

I'll setup a test server on my oldest xeon to check for speed and post an update later today.

alph4b3th · 2023-03-29T05:45:11Z

I'm waiting for you. I'm also installing another application on top of llama.cpp to test performance outside of docker. I intend to go back today too.

michaelwdombek · 2023-03-29T08:30:55Z

Ok, as promised I set dalai up using docker on my Server

1x Intel(R) Xeon(R) CPU E5-4620 v4 @ 2.10GHz
64 GB RAM
ubuntu 22.04 with 5.15.0-69
nothing tuned just plain installed ubuntu and docker

I checked the times and get roughly 2 - 3 token per second output with llama 13B but this includes loading the model to ram and suff

dgasparri · 2023-03-29T09:20:05Z

Do you use the same definition of token used by Openai, that is, 1 token ~ 750 words? Can you try to set the threads to 8 or 4? Just out of curiosity to understand the real impact of the # of threads

…

On Wed, Mar 29, 2023, 10:31 Michael ***@***.***> wrote: Ok, as promised I set dalai up using docker on my Server - 1x Intel(R) Xeon(R) CPU E5-4620 v4 @ 2.10GHz - 64 GB RAM - ubuntu 22.04 with 5.15.0-69 - nothing tuned just plain installed ubuntu and docker I checked the times and get roughly 2 - 3 token per second output with llama 13B but this includes loading the model to ram and suff — Reply to this email directly, view it on GitHub <#250 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEHB3OAHANPQEF6MSK7HB3W6PXMVANCNFSM6AAAAAAWGCJZJM> . You are receiving this because you commented.Message ID: ***@***.***>

michaelwdombek · 2023-03-29T14:10:16Z

Hey so i did some more experimenting @dgasparri :)

threads	input token	output token	output chars	time including model loading	Tokens/s
4	7	70	360	60.1 s	1.16
8	7	128	708	59.8s	2.14
16	7	142	684	59.9s	2.37

Input and output tokens are calculated using openAIs Tokenizer API and the time includes loading of the model.

alph4b3th · 2023-03-29T21:48:09Z

I installed it outside of docker and did not get different results from those already mentioned above. What is it? Because I've seen some people running alpaca 7B and it loads and responds in seconds, and on my machine which is a powerful computer for servers, even the 7B is extremely slow, even consuming 6 cores (I upgraded from intel xeon to amd epyc which reduced the answer to 8 minutes and the load to 18 minutes)

voarsh2 · 2023-03-31T05:42:21Z

Dalai loads faster than Serge. :D
I can confirm that.

alph4b3th · 2023-03-31T20:52:01Z

@voarsh2 could you provide your evidence? I tested it and the results are apparently the opposite (faster serge)

HRezaei · 2023-04-01T08:33:38Z

Some performance improvements are merged into llama.cpp two days ago here. Are they automatically reflected in this project?

titolindj · 2023-04-05T17:54:30Z

I think Dalai doesn't like Xeon's I have the exact same problem...
A decent Build with more than plenty of RAM and an 8 core Xeon, nothing fancy but plenty more power than the average laptop or desktop, and still the output takes forever, even with the simplest questions.

alph4b3th · 2023-04-06T00:15:07Z

it's not with intel xeon. I changed the server to a vps with amd epyc (6nucleos 16gb ram) and it accelerated very little. It made it possible to make the model run, but still without conditions of use, extremely slow. The slowness, from what I could analyze, comes from the llama.cpp repository when in some version to optimize the algorithm, in fact, it brought problems of intermittent performance. For some machines with windows, the model does very well, responding almost instantly compared to my linux servers... It is strange that there is an intense discussion inside the issues in an attempt to discover the mystery of the failure of the update that was supposed to give a boost in model performance. (obviously failed)

please don't blame the dalai development people. Dalai uses the llama.cpp library under the covers, as well as the serge chat that presents the same slowness problem. In fact, they all use llama.cpp, and the real problem is in some code in that library. Complain to the developers of llama.cpp or develop your own lib.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

is it normal 30-minute/token slowness in intel xeon? #250

is it normal 30-minute/token slowness in intel xeon? #250

alph4b3th commented Mar 24, 2023

michaelwdombek commented Mar 24, 2023

alph4b3th commented Mar 24, 2023

alph4b3th commented Mar 25, 2023

dgasparri commented Mar 27, 2023

RicRicci22 commented Mar 28, 2023

michaelwdombek commented Mar 28, 2023

dgasparri commented Mar 28, 2023

michaelwdombek commented Mar 28, 2023

alph4b3th commented Mar 28, 2023

michaelwdombek commented Mar 29, 2023

alph4b3th commented Mar 29, 2023

michaelwdombek commented Mar 29, 2023

dgasparri commented Mar 29, 2023 via email

michaelwdombek commented Mar 29, 2023

alph4b3th commented Mar 29, 2023

voarsh2 commented Mar 31, 2023

alph4b3th commented Mar 31, 2023

HRezaei commented Apr 1, 2023

titolindj commented Apr 5, 2023

alph4b3th commented Apr 6, 2023

is it normal 30-minute/token slowness in intel xeon? #250

is it normal 30-minute/token slowness in intel xeon? #250

Comments

alph4b3th commented Mar 24, 2023

michaelwdombek commented Mar 24, 2023

alph4b3th commented Mar 24, 2023

alph4b3th commented Mar 25, 2023

dgasparri commented Mar 27, 2023

RicRicci22 commented Mar 28, 2023

michaelwdombek commented Mar 28, 2023

dgasparri commented Mar 28, 2023

michaelwdombek commented Mar 28, 2023

alph4b3th commented Mar 28, 2023

michaelwdombek commented Mar 29, 2023

alph4b3th commented Mar 29, 2023

michaelwdombek commented Mar 29, 2023

dgasparri commented Mar 29, 2023 via email

michaelwdombek commented Mar 29, 2023

alph4b3th commented Mar 29, 2023

voarsh2 commented Mar 31, 2023

alph4b3th commented Mar 31, 2023

HRezaei commented Apr 1, 2023

titolindj commented Apr 5, 2023

alph4b3th commented Apr 6, 2023