Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is it normal 30-minute/token slowness in intel xeon? #250

Open
alph4b3th opened this issue Mar 24, 2023 · 20 comments
Open

is it normal 30-minute/token slowness in intel xeon? #250

alph4b3th opened this issue Mar 24, 2023 · 20 comments

Comments

@alph4b3th
Copy link

image
I'm running alpaca on my Intel Xeon Silver 4214(4) @ 2.194GHz and I'm having a lot of slowness for artificial intelligence to generate some response. It takes between 20-40 minutes per word to be generated, is this normal? I saw many running in setups even lower and with faster results. I wonder if there's anything wrong with my server.

@michaelwdombek
Copy link

Nope - not really. I'm running on an older CPU without issues, but you might be missing a little RAM - if your screenshot is without running a model

@alph4b3th
Copy link
Author

you're right, I noticed strange behavior while using the template. The swap memory (usually 12gb) is in 90% beyond the memory itself ram (4gb) totaling 16gb. I think I need 16gb of ram instead of 4 ram + 12 swap to speed up performance.
out of curiosity, how much ram does your computer have?

@alph4b3th
Copy link
Author

hi, I have a vps (setup attached) and I'm having a slowness when running the model 7B, 13B. The Model in addition to taking about 3-8 minutes to be loaded, has a time of 3200ms/token.. This is abnormal behavior and so far I don't know what is causing it. Can anyone help me?
image

@dgasparri
Copy link

I have the same issue, I'm running Windows 10 on a Intel i7@2.80GHz, 32GB ram and Nvidia GTX1050.

The 7B model is painfully slow to run, it uses less than 40% of CPU and 4GB of RAM (I have more than 20GB left free), and it doesn't use the GPU.

@RicRicci22
Copy link

Same here. I have a 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz.
I tried to insert the prompt and click go, but after 15 minutes of nothing I decided to break the process.. not sure what caused this bottleneck, but the utilization statistics were not going up.. it seems that it was not even loaded.

I will update if I manage to understand the issue..

@michaelwdombek
Copy link

you're right, I noticed strange behavior while using the template. The swap memory (usually 12gb) is in 90% beyond the memory itself ram (4gb) totaling 16gb. I think I need 16gb of ram instead of 4 ram + 12 swap to speed up performance. out of curiosity, how much ram does your computer have?

Sorry for the late feedback i'm running on 64 and 128 GB machines.

the only thing that gave my setup a speed boost was increasing the threads from 4 to 18.

@dgasparri
Copy link

Memory is pretty straightforward (from the docs):

7B => ~4 GB
13B => ~8 GB
30B => ~16 GB
65B => ~32 GB

Those are optimistic estimates, add +1GB each.

@michaelwdombek how many cores/threads does your CPU support? Did you see a performance increase setting the threads above your CPU threads?

@michaelwdombek
Copy link

Memory is pretty straightforward (from the docs):

7B => ~4 GB 13B => ~8 GB 30B => ~16 GB 65B => ~32 GB

Those are optimistic estimates, add +1GB each.

@michaelwdombek how many cores/threads does your CPU support? Did you see a performance increase setting the threads above your CPU threads?

Never actually used more threads than i have and just tried it. Turns out it gets really , really slow if I oversubscribe. So i would recommend against that.

@alph4b3th
Copy link
Author

I saw in another personal forum saying that there was slowness with docker. Are you running with or without docker?

@michaelwdombek
Copy link

I saw in another personal forum saying that there was slowness with docker. Are you running with or without docker?

I'm running in the docker compose setup provided here. So nope does not look like a docker problem.

I'll setup a test server on my oldest xeon to check for speed and post an update later today.

@alph4b3th
Copy link
Author

I'm waiting for you. I'm also installing another application on top of llama.cpp to test performance outside of docker. I intend to go back today too.

@michaelwdombek
Copy link

Ok, as promised I set dalai up using docker on my Server

  • 1x Intel(R) Xeon(R) CPU E5-4620 v4 @ 2.10GHz
  • 64 GB RAM
  • ubuntu 22.04 with 5.15.0-69
  • nothing tuned just plain installed ubuntu and docker

I checked the times and get roughly 2 - 3 token per second output with llama 13B but this includes loading the model to ram and suff

@dgasparri
Copy link

dgasparri commented Mar 29, 2023 via email

@michaelwdombek
Copy link

Hey so i did some more experimenting @dgasparri :)

threads input token output token output chars time including model loading Tokens/s
4 7 70 360 60.1 s 1.16
8 7 128 708 59.8s 2.14
16 7 142 684 59.9s 2.37

Input and output tokens are calculated using openAIs Tokenizer API and the time includes loading of the model.

@alph4b3th
Copy link
Author

I installed it outside of docker and did not get different results from those already mentioned above. What is it? Because I've seen some people running alpaca 7B and it loads and responds in seconds, and on my machine which is a powerful computer for servers, even the 7B is extremely slow, even consuming 6 cores (I upgraded from intel xeon to amd epyc which reduced the answer to 8 minutes and the load to 18 minutes)

@voarsh2
Copy link

voarsh2 commented Mar 31, 2023

Dalai loads faster than Serge. :D
I can confirm that.

@alph4b3th
Copy link
Author

@voarsh2 could you provide your evidence? I tested it and the results are apparently the opposite (faster serge)

@HRezaei
Copy link

HRezaei commented Apr 1, 2023

Some performance improvements are merged into llama.cpp two days ago here. Are they automatically reflected in this project?

@titolindj
Copy link

I think Dalai doesn't like Xeon's I have the exact same problem...
A decent Build with more than plenty of RAM and an 8 core Xeon, nothing fancy but plenty more power than the average laptop or desktop, and still the output takes forever, even with the simplest questions.

@alph4b3th
Copy link
Author

it's not with intel xeon. I changed the server to a vps with amd epyc (6nucleos 16gb ram) and it accelerated very little. It made it possible to make the model run, but still without conditions of use, extremely slow. The slowness, from what I could analyze, comes from the llama.cpp repository when in some version to optimize the algorithm, in fact, it brought problems of intermittent performance. For some machines with windows, the model does very well, responding almost instantly compared to my linux servers... It is strange that there is an intense discussion inside the issues in an attempt to discover the mystery of the failure of the update that was supposed to give a boost in model performance. (obviously failed)

please don't blame the dalai development people. Dalai uses the llama.cpp library under the covers, as well as the serge chat that presents the same slowness problem. In fact, they all use llama.cpp, and the real problem is in some code in that library. Complain to the developers of llama.cpp or develop your own lib.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants