Incredibly slow response time #49

MartinPJB · 2023-04-08T20:30:44Z

Hello.
I am still new to llama-cpp and I was wondering if it was normal that it takes an incredibly long time to respond to my prompt.

Fyi, I am assuming it runs on my CPU, here are my specs:

I have 16.0Gb of RAM
I am using an AMD Ryzen 7 1700X Eight-Core Processor rated at 3.40Ghz
Just in case, my GPU is a NVIDIA GeForce RTX 2070 SUPER.

Everything else seems to work fine, the model could be load correctly (Or at least, it seems to be).
I did a first test using the code showcased in the README.md

from llama_cpp import Llama
llm = Llama(model_path="models/7B/...")
output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)
print(output)

which returned me this:

The output is what I expected (Even though Uranus, Neptune and Pluto were missing), but when I see the total time, it is extremely long (1124707.08ms, 18 minutes).

I did this second code in order to try a bit to see what could be causing the insanely long response time but I don't know what's going on.

from llama_cpp import Llama
import time
print("Model loading")
llm = Llama(model_path="./model/ggml-model-q4_0_new.bin")

while True:
    prompt = input("Prompt> ")
    start_time = time.time()

    prompt = f"Q: {prompt} A: "
    print("Your prompt:", prompt, "Start time:", start_time)

    output = llm(prompt, max_tokens=1, stop=["Q:", "\n"], echo=True)
    print("Output:", output)
    print("End time:", time.time())
    print("--- Prompt reply duration: %s seconds ---" % (time.time() - start_time))

I may have done things wrong since I am still new to all of this, but do any of you have any idea on how I could speed up the process? I searched for solutions through google, github and different forums, but nothing seems to work.

PS: For those interested in the CLI output when it loads the model:

llama_model_load: loading model from './model/ggml-model-q4_0_new.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)
llama_model_load: loading tensors from './model/ggml-model-q4_0_new.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  512.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

I apologize in advance if my english doesn't make sense sometimes, it is not my native language.
Thanks in advance for the help, regards. 👋

The text was updated successfully, but these errors were encountered:

abetlen · 2023-04-08T20:40:48Z

Hi @MartinPJB , it looks like the package was built with the correct optimizations, could you pass verbose=True when instantiating the Llama class, this should give you per-token timing information.

Also, if possible, can you try building the regular llama.cpp project and trying out those examples just to confirm that this issue is localized to the python package.

MartinPJB · 2023-04-08T20:48:35Z

Hi @MartinPJB , it looks like the package was built with the correct optimizations, could you pass verbose=True when instantiating the Llama class, this should give you per-token timing information.

Also, if possible, can you try building the regular llama.cpp project and trying out those examples just to confirm that this issue is localized to the python package.

Hi @abetlen, I did pass verbose=True when instantiating the Llama class:

llm = Llama(model_path="./model/ggml-model-q4_0_new.bin", verbose=True)

I assume it is what you meant. However, it is not printing any details in the console after the submission of my prompt.

Since you said you built the package with the correct optimizations, maybe it is coming from my computer itself, however, I do not have any idea why it is so slow.

By the way, I am not familiar with building the regular llama.cpp project. Maybe you could provide some help or other issues related to that.

abetlen · 2023-04-08T20:51:53Z

If you follow these instructions you should be able to test it out https://github.com/ggerganov/llama.cpp#usage

MartinPJB · 2023-04-08T20:55:55Z

No worries. I am gonna do what's needed and share you the results in a bit / tomorrow, if that is good for you.

keldenl · 2023-04-08T22:43:49Z

For what it's worth, I am also getting a lot slower generations vs. natively interacting with llama.cpp in the terminal

keldenl · 2023-04-09T01:48:04Z

@abetlen I did some testing on my M1 Mac, and here are my results (llama.cpp is faster)

Stats for calling /v1/chat/completions

llama_print_timings:        load time =  3102.15 ms
llama_print_timings:      sample time =   189.52 ms /    57 runs   (    3.32 ms per run)
llama_print_timings: prompt eval time = 10289.07 ms /    61 tokens (  168.67 ms per token)
llama_print_timings:        eval time =  8149.10 ms /    56 runs   (  145.52 ms per run)
llama_print_timings:       total time = 18632.98 ms

Llama.cpp

llama_print_timings:        load time = 19000.44 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 18459.57 ms /   512 tokens (   36.05 ms per token)
llama_print_timings:        eval time =  3828.14 ms /     1 runs   ( 3828.14 ms per run)
llama_print_timings:       total time = 22828.60 ms

llama_print_timings:        load time = 22379.05 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 21632.51 ms /   512 tokens (   42.25 ms per token)
llama_print_timings:        eval time =  2172.59 ms /     1 runs   ( 2172.59 ms per run)
llama_print_timings:       total time = 24551.66 ms

Here are the model loads (looks the same to me)

llama-cpp-python

llama_model_load: n_vocab = 32001
llama_model_load: n_ctx   = 2048
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 2
llama_model_load: type    = 2
llama_model_load: ggml map size = 7759.84 MB
llama_model_load: ggml ctx size = 101.25 KB
llama_model_load: mem required  = 9807.93 MB (+ 1608.00 MB per state)

llama.cpp

llama_model_load: n_vocab = 32001
llama_model_load: n_ctx   = 2048
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 2
llama_model_load: type    = 2
llama_model_load: ggml map size = 7759.84 MB
llama_model_load: ggml ctx size = 101.25 KB
llama_model_load: mem required  = 9807.93 MB (+ 1608.00 MB per state)

Thoughts? Am I doing this right? I just added verbose=True and on llama.cpp I ran it with --mtest

keldenl · 2023-04-09T01:59:09Z

What's the difference between prompt_eval time and eval time? seems like llama.cpp is higher on eval time for some reason 🤷

keldenl · 2023-04-09T04:36:53Z

Ran a couple more runs for llama.cpp with --mlock (using 13B model) and got some better (?) results

llama_print_timings:        load time = 15164.63 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 14309.46 ms /   512 tokens (   27.95 ms per token)
llama_print_timings:        eval time =   821.64 ms /     1 runs   (  821.64 ms per run)
llama_print_timings:       total time = 15986.28 ms

llama_print_timings:        load time = 15460.01 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 14507.19 ms /   512 tokens (   28.33 ms per token)
llama_print_timings:        eval time =   900.15 ms /     1 runs   (  900.15 ms per run)
llama_print_timings:       total time = 16360.18 ms

MartinPJB · 2023-04-09T09:20:19Z

For some reason, it seems like no matter what model I use, the response time stays around the same (Between 18 and 25 minutes). I did a series of test tonight and all of them returned around the same execution time.

I did some tests as well using llama.cpp, and it doesn't seem to change either, llama.cpp takes a long time to respond on my computer as well.

Do any of you have an idea of what I am doing wrong? Thanks.

MartinPJB · 2023-04-09T12:04:47Z

Update:
llama.cpp took almost 2.5 hours to respond to my prompt, while the python binding took 18 minutes. I'm not sure if the problem is with the library, but rather with my computer.
Maybe my computer has something to do with the slow response time. If any of you have an idea as to why it is so slow on my PC, I'd love to hear it.

acheong08 · 2023-04-09T15:47:29Z

With LLaMA.cpp:

llama_print_timings:        load time =  3927.93 ms
llama_print_timings:      sample time =     5.51 ms /    10 runs   (    0.55 ms per run)
llama_print_timings: prompt eval time = 21902.09 ms /    48 tokens (  456.29 ms per token)
llama_print_timings:        eval time =  4735.28 ms /     9 runs   (  526.14 ms per run)
llama_print_timings:       total time = 27363.51 ms

With llama-cpp-python:

Output generated in 121.29 seconds (0.06 tokens/s, 7 tokens, context 44)

It is significantly slower on orders of magnitude

MartinPJB · 2023-04-09T15:50:38Z

With LLaMA.cpp:

llama_print_timings:        load time =  3927.93 ms
llama_print_timings:      sample time =     5.51 ms /    10 runs   (    0.55 ms per run)
llama_print_timings: prompt eval time = 21902.09 ms /    48 tokens (  456.29 ms per token)
llama_print_timings:        eval time =  4735.28 ms /     9 runs   (  526.14 ms per run)
llama_print_timings:       total time = 27363.51 ms

With llama-cpp-python:

Output generated in 121.29 seconds (0.06 tokens/s, 7 tokens, context 44)

It is significantly slower on orders of magnitude

@acheong08 I see, although it shouldn't take around 18 minutes to generate an output, should it?

acheong08 · 2023-04-09T15:51:54Z

It shouldn't. Make sure you have the latest version of llama.cpp installed and make -j. They might have added some optimizations. Also make sure you have enough memory. It falls back to swap which is very slow

MartinPJB · 2023-04-09T15:54:12Z

I see, I'll try that and I'll keep you updated. By the way, is there any way to check how much memory is used by the binding during its execution?

acheong08 · 2023-04-09T15:54:57Z

System monitor / task manager and look for python processes

acheong08 · 2023-04-09T15:55:29Z

I'm still not sure why this python binding is so much slower than the standard llama.cpp though

MartinPJB · 2023-04-09T16:11:14Z

@acheong08
I'm not sure either. After some checks, it seems like my python script using the binding goes up to around (more or less) 7.65 Gb of memory used in average. Just in case, I updated the library as well, since I noticed the owner of the repo did some editing. However, it still takes a lot of time to generate an answer to me prompt.

I have one running right now, I've launched it around 5 minutes ago, I'll keep you updated when it ends and how long it lasted.

MartinPJB · 2023-04-09T17:25:17Z

@acheong08 I'm not sure either. After some checks, it seems like my python script using the binding goes up to around (more or less) 7.65 Gb of memory used in average. Just in case, I updated the library as well, since I noticed the owner of the repo did some editing. However, it still takes a lot of time to generate an answer to me prompt.

I have one running right now, I've launched it around 5 minutes ago, I'll keep you updated when it ends and how long it lasted.

Took 74 minutes this time. I really wonder what could be wrong.

louis030195 · 2023-04-09T19:03:40Z

is llamacpp even supporting gpu? I don't see any function or something to move things to different hardware just as in pytorch etc usually?

acheong08 · 2023-04-09T23:56:09Z

is llamacpp even supporting gpu? I don't see any function or something to move things to different hardware just as in pytorch etc usually?

no

bratislav · 2023-04-10T19:44:54Z

Hi all,

I also confirm there's a performance issue with this, but I looks to me the problem is only inside the high_level (fastapi) variant.

The regular llama.cpp works fine on our server - it gives response to a simple instructions in a matter of seconds (it's a small soho server with 4 core i5 and a 16GB of memory). I also tried the low_level variant of this python binding and - interestingly enough - it works almost the same as llama.cpp (starts generating words in around ~5-6 seconds)!

When I finally tried with the web server version (high_level variant) over the /docs API client, I got much worse performance - usually takes around 25 seconds to start generating anything :( I started tinkering around with the parameters and increased the number of threads from 2 to 4 and got somewhat better results (wait time is now around 16 seconds more or less), but still like 2x slower than llama.cpp variant. I also tried to enable mmap (had to increase the size using ulimit command), but no luck - still the same (except the memory is now almost full and the swap starts eating the space). I even tried playing with other parameters, but it seems nothing can speed up this server - which is so weird since it's supposed to be a web service extension over a python c++ wrapper (umm, right?).

Didn't have time to dive deeper into the code, but my naive guess is that it looks like something is fishy in the high_level code that makes the generation slower than the low_level variant, but no idea what.

Let me know guys if you need some other info, I'm very interested in getting this to work since we're using llms as micro services extensively.

Cheers

abetlen · 2023-04-10T21:00:45Z

@bratislav I'll definitely look into this, can you provide some minimal examples? Also there are some high level api examples do you mind confirming the issue is there or in the server. There may be some weird asyncio bugs with the server so just trying to figure out where this could be coming from.

I also provided a small example in this graph to show how you can generate a flamegraph to profile where the majority of time is being spent. That might help us track down wether it's a python issue or comes from different parameters / settings in llama.cpp #51 (comment)

abetlen · 2023-04-10T21:28:10Z

Here are my runs for the high level vs. low level api on my machine just for reference. I've highlighted the portions which are running in C++ in the brown and provided the system information and per token timings for reference.

Low Level API

High Level API

MillionthOdin16 · 2023-04-11T02:04:32Z

Update:
llama.cpp took almost 2.5 hours to respond to my prompt, while the python binding took 18 minutes. I'm not sure if the problem is with the library, but rather with my computer.
Maybe my computer has something to do with the slow response time. If any of you have an idea as to why it is so slow on my PC, I'd love to hear it.

Look at your memory and disk utilization. If your ram is anywhere near 100% you're going to have bad performance if it's trying to use swap to compensate for a lack of memory. If your disc utilization goes up as you run the executable, that's a sign that bad things are happening.

If it takes minutes to compute a response then something is definitely wrong with how things are running.

bratislav · 2023-04-11T09:52:56Z

Hi @abetlen, thanks for a quick response.

Here's our soho server info:

Ubuntu 20.04.5 LTS
i5
16GB RAM
Python: 3.8.10
gcc: 9.4.0
llama-cpp-python: 0.1.30

I have downloaded the alpaca and the llama models from the torrent found here, using magnet link at the top of the page...not sure if the models are the right ones, but here are the files I'm using:

llama-7b-ggml-q4_0/ggml-model-q4_0.bin
alpaca-7b-ggml-q4_0-lora-merged/ggml-model-q4_0.bin

(haven't converted or modified them, I'm just running them)

Okay, so the first run is the regular (latest) llama.cpp, downloaded and built yesterday:

baki@srv-mini:~/llama.cpp$ ./examples/alpaca.sh 
main: seed = 1681203711
llama_model_load: loading model from './models/7B/ggml-alpaca-7b-q4.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)
llama_model_load: loading tensors from './models/7B/ggml-alpaca-7b-q4.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 3 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling: temp = 0.200000, top_k = 10000, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.000000
generate: n_ctx = 512, n_batch = 256, n_predict = 128, n_keep = 21


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - If you want to submit another line, end your input in '\'.

 Below is an instruction that describes a task. Write a response that appropriately completes the request.
> Hi, from now on your name is Bob, a virtual assistant made by Baki.
Hello, I'm Bob.

### Instruction:


> Who made you?
I was made by Baki.

### Instruction:


>

Timings:

Model loading (until first input shows): ~ 6 seconds
After first instruction, response shows after: ~7 seconds
After second instruction, response shows after: ~4 seconds

Second run, I try the low_level python wrapper around the same llama.cpp version (downloaded into /vendor dir), on the same machine:

baki@srv-mini:~/llama-cpp-python$ python3 examples/low_level_api/low_level_api_chat_cpp.py --model ~/llama.cpp/models/7B/ggml-llama-7b-q4.bin -r "User:" --interactive
seed = 1681204762
llama.cpp: loading model from /home/baki/llama.cpp/models/7B/ggml-llama-7b-q4.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: f16        = 2
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 2052.00 MB per state)
llama_init_from_file: kv self size  =  512.00 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
interactive mode on.
Reverse prompt: 'User:'
sampling: temp = 0.8,top_k = 40,top_p = 0.95,repeat_last_n = 64,repeat_penalty = 1.1

generate: n_ctx = 512, n_batch = 8, n_predict = 128, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - If you want to submit another line, end your input in '\'.


 2014-15 Women's Lacrosse News
Virginia Tech Announces Schedule for its Second Season of Lax
BLACKSBURG, Va. – Virginia Tech women’s lacrosse head coach Shannon Smith announced her team’s 2015 schedule on Thursday morning (December 4). The Hokies open their season at home against Longwood (Jan. 29) before heading to North Carolina for a Feb. 1 matchup with Duke.
Six Women's Lacrosse Players Named All

Not sure if I messed up something with the params, but this example in interactive mode doesn't provide any input, it just starts spitting out words - and stops after some number. Also, if I change the high_level example from chat to llama, it just segfault:

baki@srv-mini:~/llama-cpp-python$ python3 examples/low_level_api/low_level_api_llama_cpp.py --model ~/llama.cpp/models/7B/ggml-llama-7b-q4.bin
error loading model: failed to open ../models/7B/ggml-model.bin: No such file or directory
llama_init_from_file: failed to load model
Segmentation fault (core dumped)

Timings of chat example:

Model loading (until first word shows): almost immediately
Number of words per second: around 3 words/sec (Awesomee!)

Third run, I try with high_level_api example, as you requested:

baki@srv-mini:~/llama-cpp-python$ python3 examples/high_level_api/high_level_api_inference.py --model ~/llama.cpp/models/7B/ggml-llama-7b-q4.bin
llama.cpp: loading model from /home/baki/llama.cpp/models/7B/ggml-llama-7b-q4.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: f16        = 2
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 2052.00 MB per state)
llama_init_from_file: kv self size  =  512.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 

llama_print_timings:        load time =  2471.73 ms
llama_print_timings:      sample time =    42.13 ms /    48 runs   (    0.88 ms per run)
llama_print_timings: prompt eval time =  5722.36 ms /    19 tokens (  301.18 ms per token)
llama_print_timings:        eval time = 17361.61 ms /    47 runs   (  369.40 ms per run)
llama_print_timings:       total time = 23131.09 ms
{
  "id": "cmpl-c6a253f3-102e-4eec-b888-a9442690e47f",
  "object": "text_completion",
  "created": 1681205386,
  "model": "/home/baki/llama.cpp/models/7B/ggml-llama-7b-q4.bin",
  "choices": [
    {
      "text": "Question: What are the names of the planets in the solar system? Answer: 1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune 9. Pluto 10.",
      "index": 0,
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 19,
    "completion_tokens": 48,
    "total_tokens": 67
  }
}

Timings:
This took around 25 seconds! (+ subsequent runs don't speed things up, I always get the similar time!)

In a final run, I try the llama-cpp-python from the pip, firing up instructions from the /docs web client (v1/chat/completion endpoint with the default JSON request) - first instruction response shows after around the same time (25 seconds) and if I bump up the threads from 2 to 4 I get the response at around 12-16 seconds. Almost everything I've tried so far to speed this up even further - failed. Also, when I build python wheel from the source - the timings stays the same.

Here's an example:

baki@srv-mini:~/llama.cpp$ python3 -m llama_cpp.server
llama.cpp: loading model from ./models/7B/ggml-alpaca-7b-q4.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: f16        = 2
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
INFO:     Started server process [315210]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     192.168.0.12:50344 - "GET / HTTP/1.1" 404 Not Found
INFO:     192.168.0.12:50344 - "GET /docs HTTP/1.1" 200 OK
INFO:     192.168.0.12:50344 - "GET /openapi.json HTTP/1.1" 200 OK

llama_print_timings:        load time =  1376.85 ms
llama_print_timings:      sample time =     9.75 ms /    11 runs   (    0.89 ms per run)
llama_print_timings: prompt eval time = 10696.38 ms /    70 tokens (  152.81 ms per token)
llama_print_timings:        eval time =  2017.02 ms /    10 runs   (  201.70 ms per run)
llama_print_timings:       total time = 12724.85 ms
INFO:     192.168.0.12:50345 - "POST /v1/chat/completions HTTP/1.1" 200 OK

I can run all these examples on our production server, which is a 4 Xeon IBM beast with a 128GB of memory to see if there's any difference.

TODO:

I haven't checked the SHA256 checksums of the model files yet (I hope the files are okay)
will run py-spy thing later today
I wonder if I'm observing the latest timings right - because this web service returns response after ALL tokens are generated, not one by one (as in some websocket examples)

bratislav · 2023-04-11T11:34:04Z

@abetlen Here are the outputs of the py-spy (usign 0.1.30):

Low-level:

baki@srv-mini:~/llama-cpp-python$ py-spy record --native -o profile.svg -- python3 examples/low_level_api/low_level_api_llama_cpp.py -m ~/llama.cpp/models/7B/ggml-llama-7b-q4.bin 
py-spy> Sampling process 100 times a second. Press Control-C to exit.

llama.cpp: loading model from /home/baki/llama.cpp/models/7B/ggml-llama-7b-q4.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: f16        = 2
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 2052.00 MB per state)
py-spy> 1.15s behind in sampling, results may be inaccurate. Try reducing the sampling rate
llama_init_from_file: kv self size  =  512.00 MB
 

### Instruction:
What is the capital of France?

### Response:
Paris.

### Questions:
Who was the first president of America?


llama_print_timings:        load time =  2942.98 ms
llama_print_timings:      sample time =    19.85 ms /    20 runs   (    0.99 ms per run)
llama_print_timings: prompt eval time =  6159.05 ms /    28 tokens (  219.97 ms per token)
llama_print_timings:        eval time =  4763.31 ms /    19 runs   (  250.70 ms per run)
llama_print_timings:       total time = 12137.12 ms

py-spy> Stopped sampling because process exited
py-spy> Wrote flamegraph data to 'profile.svg'. Samples: 1250 Errors: 0

Here is for high_level API (took almost 25 seconds!):

baki@srv-mini:~/llama-cpp-python$ py-spy record --native -o profile2.svg -- python3 examples/high_level_api/high_level_api_inference.py -m ~/llama.cpp/models/7B/ggml-llama-7b-q4.bin 
py-spy> Sampling process 100 times a second. Press Control-C to exit.

llama.cpp: loading model from /home/baki/llama.cpp/models/7B/ggml-llama-7b-q4.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: f16        = 2
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 2052.00 MB per state)
llama_init_from_file: kv self size  =  512.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 

llama_print_timings:        load time =  2629.82 ms
llama_print_timings:      sample time =    45.66 ms /    48 runs   (    0.95 ms per run)
llama_print_timings: prompt eval time =  6087.97 ms /    19 tokens (  320.42 ms per token)
llama_print_timings:        eval time = 18470.45 ms /    47 runs   (  392.99 ms per run)
llama_print_timings:       total time = 24609.50 ms
{
  "id": "cmpl-94bf3241-852a-469e-8710-fd4702e280d4",
  "object": "text_completion",
  "created": 1681212374,
  "model": "/home/baki/llama.cpp/models/7B/ggml-llama-7b-q4.bin",
  "choices": [
    {
      "text": "Question: What are the names of the planets in the solar system? Answer: 1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune 9. Pluto 10.",
      "index": 0,
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 19,
    "completion_tokens": 48,
    "total_tokens": 67
  }
}

acheong08 · 2023-04-11T11:45:53Z

Latest version fixed my issues

acheong08 · 2023-04-11T11:46:04Z

Difference seems to be minimal now

MartinPJB · 2023-04-11T19:25:42Z

It seems like it is to me as well. It now takes just a few seconds instead of those 18 minutes I had before. I don't know what happened but I am glad it got fixed. Thanks!

abetlen · 2023-04-11T19:32:16Z

It might have just been a regression in the base llama.cpp library but who knows

bratislav · 2023-04-12T09:37:49Z

@abetlen @MartinPJB @acheong08 Guys, I've updated both llama.cpp and llama-cpp-python and I'm still getting the same timings as before - no change whatsoever :( I'm starting to think my model is wrong, are you guys willing to share the link from where you downloaded the .bin files (or at least sha256 sum so I can compare with mine)?

Did you just create the *.bin file from weights or you have downloaded already-made file in ggml? Are you using ggml v2 maybe?
I've tried with another alpaca bin file I found on the web, and the inference in high_level example is even slower - 44 seconds :(

Thanks

acheong08 · 2023-04-12T14:25:49Z

@bratislav The issue was not with the bin files. I tested with the same model throughout. Specifically https://huggingface.co/Pi3141/gpt4-x-alpaca-native-13B-ggml

acheong08 · 2023-04-12T14:26:42Z

make sure to cd vendor && rm -rf llama.cpp && git clone https://github.com/ggerganov/llama.cpp/ && cd .. && pip3 install .

acheong08 · 2023-04-12T14:26:57Z

this fetches you the latest version

0xdevalias · 2023-04-14T09:18:32Z

Latest version fixed my issues

Difference seems to be minimal now

It seems like it is to me as well. It now takes just a few seconds instead of those 18 minutes I had before. I don't know what happened but I am glad it got fixed. Thanks!

It might have just been a regression in the base llama.cpp library but who knows

@acheong08 @abetlen @MartinPJB Unsure if the slowness here was related, but there was a bunch of performance regression stuff in the base lib that I was vaguely following that was resolved recently:

Thank you for this hard work - I missed this regression because I rarely run generations with more than a few tens of tokens. The problem is that the transpose operation for the V matrix is very slow and becomes slower and slower with every new token added.

I think I have provided a fix here: ggerganov/llama.cpp#775

Tested only on M1 so far

Originally posted by @ggerganov in ggerganov/llama.cpp#603 (comment)

https://github.com/ggerganov/llama.cpp/issues/#775 was merged pretty quickly, cyyynthia if it's not too much trouble, could you update your pretty graphs with that?

@sw The graphs aren't super pretty this time because I didn't take the time to properly close everything and had a bunch of things open in background while the test was running 😅

That being said, the regression appears to be gone. 🎉 Here's the graphs and the raw CSV:

token_times.csv

Originally posted by @cyyynthia in ggerganov/llama.cpp#603 (comment)

Well I guess that settles it. The Great Wizard Georgi has saved the day! Thanks to @cyyynthia and @KASR for putting in the hard work of tracking this down.

I have opened https://github.com/ggerganov/llama.cpp/issues/#790 to track the discrepancy in the different partial times vs total time. I think this issue could be closed. Thanks everyone.

Originally posted by @sw in ggerganov/llama.cpp#603 (comment)

mpaepper mentioned this issue Apr 9, 2023

Add llama.cpp LLM, brining support for locally hosted LLMs mpaepper/llm_agents#8

Open

keldenl mentioned this issue Apr 9, 2023

main.py not running on M1 Mac due to llama_context_default_params symbol not found #52

Closed

hengjiUSTC mentioned this issue Apr 10, 2023

kv model size is twice large then llama.cpp/main #66

Closed

abetlen pinned this issue Apr 10, 2023

0xdevalias mentioned this issue Apr 11, 2023

Fix for MPS support on Apple Silicon oobabooga/text-generation-webui#393

Merged

abetlen unpinned this issue Apr 11, 2023

abetlen closed this as completed Apr 11, 2023

abetlen mentioned this issue May 10, 2023

Subsequent prompts are around 10x to 12x slower than on llama.cpp "main" example. #181

Closed

4 tasks

R-Yordanov-AltScale mentioned this issue Oct 23, 2023

Illegal instruction (core dumped) when trying to load model #839

Open

Incredibly slow response time #49

Incredibly slow response time #49

Comments

MartinPJB commented Apr 8, 2023

abetlen commented Apr 8, 2023

MartinPJB commented Apr 8, 2023

abetlen commented Apr 8, 2023

MartinPJB commented Apr 8, 2023 • edited Loading

keldenl commented Apr 8, 2023

keldenl commented Apr 9, 2023 • edited Loading

keldenl commented Apr 9, 2023

keldenl commented Apr 9, 2023

MartinPJB commented Apr 9, 2023

MartinPJB commented Apr 9, 2023

acheong08 commented Apr 9, 2023

MartinPJB commented Apr 9, 2023 • edited Loading

acheong08 commented Apr 9, 2023

MartinPJB commented Apr 9, 2023

acheong08 commented Apr 9, 2023

acheong08 commented Apr 9, 2023

MartinPJB commented Apr 9, 2023 • edited Loading

MartinPJB commented Apr 9, 2023

louis030195 commented Apr 9, 2023

acheong08 commented Apr 9, 2023

bratislav commented Apr 10, 2023 • edited Loading

abetlen commented Apr 10, 2023

abetlen commented Apr 10, 2023

MillionthOdin16 commented Apr 11, 2023

bratislav commented Apr 11, 2023 • edited Loading

bratislav commented Apr 11, 2023 • edited Loading

acheong08 commented Apr 11, 2023

acheong08 commented Apr 11, 2023

MartinPJB commented Apr 11, 2023

abetlen commented Apr 11, 2023

bratislav commented Apr 12, 2023 • edited Loading

acheong08 commented Apr 12, 2023

acheong08 commented Apr 12, 2023

acheong08 commented Apr 12, 2023

0xdevalias commented Apr 14, 2023 • edited Loading

MartinPJB commented Apr 8, 2023 •

edited

Loading

keldenl commented Apr 9, 2023 •

edited

Loading

MartinPJB commented Apr 9, 2023 •

edited

Loading

MartinPJB commented Apr 9, 2023 •

edited

Loading

bratislav commented Apr 10, 2023 •

edited

Loading

bratislav commented Apr 11, 2023 •

edited

Loading

bratislav commented Apr 11, 2023 •

edited

Loading

bratislav commented Apr 12, 2023 •

edited

Loading

0xdevalias commented Apr 14, 2023 •

edited

Loading