Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incredibly slow response time #49

Closed
MartinPJB opened this issue Apr 8, 2023 · 35 comments
Closed

Incredibly slow response time #49

MartinPJB opened this issue Apr 8, 2023 · 35 comments

Comments

@MartinPJB
Copy link

Hello.
I am still new to llama-cpp and I was wondering if it was normal that it takes an incredibly long time to respond to my prompt.

Fyi, I am assuming it runs on my CPU, here are my specs:

  • I have 16.0Gb of RAM
  • I am using an AMD Ryzen 7 1700X Eight-Core Processor rated at 3.40Ghz
  • Just in case, my GPU is a NVIDIA GeForce RTX 2070 SUPER.

Everything else seems to work fine, the model could be load correctly (Or at least, it seems to be).
I did a first test using the code showcased in the README.md

from llama_cpp import Llama
llm = Llama(model_path="models/7B/...")
output = llm("Q: Name the planets in the solar system? A: ", max_tokens=32, stop=["Q:", "\n"], echo=True)
print(output)

which returned me this:

image
The output is what I expected (Even though Uranus, Neptune and Pluto were missing), but when I see the total time, it is extremely long (1124707.08ms, 18 minutes).

I did this second code in order to try a bit to see what could be causing the insanely long response time but I don't know what's going on.

from llama_cpp import Llama
import time
print("Model loading")
llm = Llama(model_path="./model/ggml-model-q4_0_new.bin")

while True:
    prompt = input("Prompt> ")
    start_time = time.time()

    prompt = f"Q: {prompt} A: "
    print("Your prompt:", prompt, "Start time:", start_time)

    output = llm(prompt, max_tokens=1, stop=["Q:", "\n"], echo=True)
    print("Output:", output)
    print("End time:", time.time())
    print("--- Prompt reply duration: %s seconds ---" % (time.time() - start_time))

I may have done things wrong since I am still new to all of this, but do any of you have any idea on how I could speed up the process? I searched for solutions through google, github and different forums, but nothing seems to work.

PS: For those interested in the CLI output when it loads the model:

llama_model_load: loading model from './model/ggml-model-q4_0_new.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 2052.00 MB per state)
llama_model_load: loading tensors from './model/ggml-model-q4_0_new.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  512.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

I apologize in advance if my english doesn't make sense sometimes, it is not my native language.
Thanks in advance for the help, regards. 👋

@abetlen
Copy link
Owner

abetlen commented Apr 8, 2023

Hi @MartinPJB , it looks like the package was built with the correct optimizations, could you pass verbose=True when instantiating the Llama class, this should give you per-token timing information.

Also, if possible, can you try building the regular llama.cpp project and trying out those examples just to confirm that this issue is localized to the python package.

@MartinPJB
Copy link
Author

Hi @MartinPJB , it looks like the package was built with the correct optimizations, could you pass verbose=True when instantiating the Llama class, this should give you per-token timing information.

Also, if possible, can you try building the regular llama.cpp project and trying out those examples just to confirm that this issue is localized to the python package.

Hi @abetlen, I did pass verbose=True when instantiating the Llama class:

llm = Llama(model_path="./model/ggml-model-q4_0_new.bin", verbose=True)

I assume it is what you meant. However, it is not printing any details in the console after the submission of my prompt.
image

Since you said you built the package with the correct optimizations, maybe it is coming from my computer itself, however, I do not have any idea why it is so slow.

By the way, I am not familiar with building the regular llama.cpp project. Maybe you could provide some help or other issues related to that.

@abetlen
Copy link
Owner

abetlen commented Apr 8, 2023

If you follow these instructions you should be able to test it out https://github.com/ggerganov/llama.cpp#usage

@MartinPJB
Copy link
Author

MartinPJB commented Apr 8, 2023

No worries. I am gonna do what's needed and share you the results in a bit / tomorrow, if that is good for you.

@keldenl
Copy link

keldenl commented Apr 8, 2023

For what it's worth, I am also getting a lot slower generations vs. natively interacting with llama.cpp in the terminal

@keldenl
Copy link

keldenl commented Apr 9, 2023

@abetlen I did some testing on my M1 Mac, and here are my results (llama.cpp is faster)

Stats for calling /v1/chat/completions

llama_print_timings:        load time =  3102.15 ms
llama_print_timings:      sample time =   189.52 ms /    57 runs   (    3.32 ms per run)
llama_print_timings: prompt eval time = 10289.07 ms /    61 tokens (  168.67 ms per token)
llama_print_timings:        eval time =  8149.10 ms /    56 runs   (  145.52 ms per run)
llama_print_timings:       total time = 18632.98 ms

Llama.cpp

llama_print_timings:        load time = 19000.44 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 18459.57 ms /   512 tokens (   36.05 ms per token)
llama_print_timings:        eval time =  3828.14 ms /     1 runs   ( 3828.14 ms per run)
llama_print_timings:       total time = 22828.60 ms
llama_print_timings:        load time = 22379.05 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 21632.51 ms /   512 tokens (   42.25 ms per token)
llama_print_timings:        eval time =  2172.59 ms /     1 runs   ( 2172.59 ms per run)
llama_print_timings:       total time = 24551.66 ms

Here are the model loads (looks the same to me)

llama-cpp-python

llama_model_load: n_vocab = 32001
llama_model_load: n_ctx   = 2048
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 2
llama_model_load: type    = 2
llama_model_load: ggml map size = 7759.84 MB
llama_model_load: ggml ctx size = 101.25 KB
llama_model_load: mem required  = 9807.93 MB (+ 1608.00 MB per state)

llama.cpp

llama_model_load: n_vocab = 32001
llama_model_load: n_ctx   = 2048
llama_model_load: n_embd  = 5120
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 40
llama_model_load: n_layer = 40
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 13824
llama_model_load: n_parts = 2
llama_model_load: type    = 2
llama_model_load: ggml map size = 7759.84 MB
llama_model_load: ggml ctx size = 101.25 KB
llama_model_load: mem required  = 9807.93 MB (+ 1608.00 MB per state)

Thoughts? Am I doing this right? I just added verbose=True and on llama.cpp I ran it with --mtest

@keldenl
Copy link

keldenl commented Apr 9, 2023

What's the difference between prompt_eval time and eval time? seems like llama.cpp is higher on eval time for some reason 🤷

@keldenl
Copy link

keldenl commented Apr 9, 2023

Ran a couple more runs for llama.cpp with --mlock (using 13B model) and got some better (?) results

llama_print_timings:        load time = 15164.63 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 14309.46 ms /   512 tokens (   27.95 ms per token)
llama_print_timings:        eval time =   821.64 ms /     1 runs   (  821.64 ms per run)
llama_print_timings:       total time = 15986.28 ms
llama_print_timings:        load time = 15460.01 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time = 14507.19 ms /   512 tokens (   28.33 ms per token)
llama_print_timings:        eval time =   900.15 ms /     1 runs   (  900.15 ms per run)
llama_print_timings:       total time = 16360.18 ms

@MartinPJB
Copy link
Author

For some reason, it seems like no matter what model I use, the response time stays around the same (Between 18 and 25 minutes). I did a series of test tonight and all of them returned around the same execution time.

I did some tests as well using llama.cpp, and it doesn't seem to change either, llama.cpp takes a long time to respond on my computer as well.

Do any of you have an idea of what I am doing wrong? Thanks.

@MartinPJB
Copy link
Author

Update:
llama.cpp took almost 2.5 hours to respond to my prompt, while the python binding took 18 minutes. I'm not sure if the problem is with the library, but rather with my computer.
Maybe my computer has something to do with the slow response time. If any of you have an idea as to why it is so slow on my PC, I'd love to hear it.

@acheong08
Copy link

With LLaMA.cpp:

llama_print_timings:        load time =  3927.93 ms
llama_print_timings:      sample time =     5.51 ms /    10 runs   (    0.55 ms per run)
llama_print_timings: prompt eval time = 21902.09 ms /    48 tokens (  456.29 ms per token)
llama_print_timings:        eval time =  4735.28 ms /     9 runs   (  526.14 ms per run)
llama_print_timings:       total time = 27363.51 ms

With llama-cpp-python:

Output generated in 121.29 seconds (0.06 tokens/s, 7 tokens, context 44)

It is significantly slower on orders of magnitude

@MartinPJB
Copy link
Author

MartinPJB commented Apr 9, 2023

With LLaMA.cpp:

llama_print_timings:        load time =  3927.93 ms
llama_print_timings:      sample time =     5.51 ms /    10 runs   (    0.55 ms per run)
llama_print_timings: prompt eval time = 21902.09 ms /    48 tokens (  456.29 ms per token)
llama_print_timings:        eval time =  4735.28 ms /     9 runs   (  526.14 ms per run)
llama_print_timings:       total time = 27363.51 ms

With llama-cpp-python:

Output generated in 121.29 seconds (0.06 tokens/s, 7 tokens, context 44)

It is significantly slower on orders of magnitude

@acheong08 I see, although it shouldn't take around 18 minutes to generate an output, should it?

@acheong08
Copy link

It shouldn't. Make sure you have the latest version of llama.cpp installed and make -j. They might have added some optimizations. Also make sure you have enough memory. It falls back to swap which is very slow

@MartinPJB
Copy link
Author

I see, I'll try that and I'll keep you updated. By the way, is there any way to check how much memory is used by the binding during its execution?

@acheong08
Copy link

System monitor / task manager and look for python processes

@acheong08
Copy link

I'm still not sure why this python binding is so much slower than the standard llama.cpp though

@MartinPJB
Copy link
Author

MartinPJB commented Apr 9, 2023

@acheong08
I'm not sure either. After some checks, it seems like my python script using the binding goes up to around (more or less) 7.65 Gb of memory used in average. Just in case, I updated the library as well, since I noticed the owner of the repo did some editing. However, it still takes a lot of time to generate an answer to me prompt.

I have one running right now, I've launched it around 5 minutes ago, I'll keep you updated when it ends and how long it lasted.

@MartinPJB
Copy link
Author

@acheong08 I'm not sure either. After some checks, it seems like my python script using the binding goes up to around (more or less) 7.65 Gb of memory used in average. Just in case, I updated the library as well, since I noticed the owner of the repo did some editing. However, it still takes a lot of time to generate an answer to me prompt.

I have one running right now, I've launched it around 5 minutes ago, I'll keep you updated when it ends and how long it lasted.

Took 74 minutes this time. I really wonder what could be wrong.

@louis030195
Copy link

is llamacpp even supporting gpu? I don't see any function or something to move things to different hardware just as in pytorch etc usually?

@acheong08
Copy link

is llamacpp even supporting gpu? I don't see any function or something to move things to different hardware just as in pytorch etc usually?

no

@bratislav
Copy link

bratislav commented Apr 10, 2023

Hi all,

I also confirm there's a performance issue with this, but I looks to me the problem is only inside the high_level (fastapi) variant.

The regular llama.cpp works fine on our server - it gives response to a simple instructions in a matter of seconds (it's a small soho server with 4 core i5 and a 16GB of memory). I also tried the low_level variant of this python binding and - interestingly enough - it works almost the same as llama.cpp (starts generating words in around ~5-6 seconds)!

When I finally tried with the web server version (high_level variant) over the /docs API client, I got much worse performance - usually takes around 25 seconds to start generating anything :( I started tinkering around with the parameters and increased the number of threads from 2 to 4 and got somewhat better results (wait time is now around 16 seconds more or less), but still like 2x slower than llama.cpp variant. I also tried to enable mmap (had to increase the size using ulimit command), but no luck - still the same (except the memory is now almost full and the swap starts eating the space). I even tried playing with other parameters, but it seems nothing can speed up this server - which is so weird since it's supposed to be a web service extension over a python c++ wrapper (umm, right?).

Didn't have time to dive deeper into the code, but my naive guess is that it looks like something is fishy in the high_level code that makes the generation slower than the low_level variant, but no idea what.

Let me know guys if you need some other info, I'm very interested in getting this to work since we're using llms as micro services extensively.

Cheers

@abetlen
Copy link
Owner

abetlen commented Apr 10, 2023

@bratislav I'll definitely look into this, can you provide some minimal examples? Also there are some high level api examples do you mind confirming the issue is there or in the server. There may be some weird asyncio bugs with the server so just trying to figure out where this could be coming from.

I also provided a small example in this graph to show how you can generate a flamegraph to profile where the majority of time is being spent. That might help us track down wether it's a python issue or comes from different parameters / settings in llama.cpp #51 (comment)

@abetlen abetlen pinned this issue Apr 10, 2023
@abetlen
Copy link
Owner

abetlen commented Apr 10, 2023

Here are my runs for the high level vs. low level api on my machine just for reference. I've highlighted the portions which are running in C++ in the brown and provided the system information and per token timings for reference.

Low Level API
image
image

High Level API
image
image

@MillionthOdin16
Copy link
Contributor

Update:
llama.cpp took almost 2.5 hours to respond to my prompt, while the python binding took 18 minutes. I'm not sure if the problem is with the library, but rather with my computer.
Maybe my computer has something to do with the slow response time. If any of you have an idea as to why it is so slow on my PC, I'd love to hear it.

Look at your memory and disk utilization. If your ram is anywhere near 100% you're going to have bad performance if it's trying to use swap to compensate for a lack of memory. If your disc utilization goes up as you run the executable, that's a sign that bad things are happening.

If it takes minutes to compute a response then something is definitely wrong with how things are running.

@abetlen abetlen unpinned this issue Apr 11, 2023
@bratislav
Copy link

bratislav commented Apr 11, 2023

Hi @abetlen, thanks for a quick response.

Here's our soho server info:

Ubuntu 20.04.5 LTS
i5
16GB RAM
Python: 3.8.10
gcc: 9.4.0
llama-cpp-python: 0.1.30

I have downloaded the alpaca and the llama models from the torrent found here, using magnet link at the top of the page...not sure if the models are the right ones, but here are the files I'm using:

llama-7b-ggml-q4_0/ggml-model-q4_0.bin
alpaca-7b-ggml-q4_0-lora-merged/ggml-model-q4_0.bin

(haven't converted or modified them, I'm just running them)

Okay, so the first run is the regular (latest) llama.cpp, downloaded and built yesterday:

baki@srv-mini:~/llama.cpp$ ./examples/alpaca.sh 
main: seed = 1681203711
llama_model_load: loading model from './models/7B/ggml-alpaca-7b-q4.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml map size = 4017.70 MB
llama_model_load: ggml ctx size =  81.25 KB
llama_model_load: mem required  = 5809.78 MB (+ 1026.00 MB per state)
llama_model_load: loading tensors from './models/7B/ggml-alpaca-7b-q4.bin'
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 3 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
main: interactive mode on.
Reverse prompt: '### Instruction:

'
sampling: temp = 0.200000, top_k = 10000, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.000000
generate: n_ctx = 512, n_batch = 256, n_predict = 128, n_keep = 21


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - If you want to submit another line, end your input in '\'.

 Below is an instruction that describes a task. Write a response that appropriately completes the request.
> Hi, from now on your name is Bob, a virtual assistant made by Baki.
Hello, I'm Bob.

### Instruction:


> Who made you?
I was made by Baki.

### Instruction:


>

Timings:

Model loading (until first input shows): ~ 6 seconds
After first instruction, response shows after: ~7 seconds
After second instruction, response shows after: ~4 seconds

Second run, I try the low_level python wrapper around the same llama.cpp version (downloaded into /vendor dir), on the same machine:

baki@srv-mini:~/llama-cpp-python$ python3 examples/low_level_api/low_level_api_chat_cpp.py --model ~/llama.cpp/models/7B/ggml-llama-7b-q4.bin -r "User:" --interactive
seed = 1681204762
llama.cpp: loading model from /home/baki/llama.cpp/models/7B/ggml-llama-7b-q4.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: f16        = 2
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 2052.00 MB per state)
llama_init_from_file: kv self size  =  512.00 MB

system_info: n_threads = 4 / 4 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
interactive mode on.
Reverse prompt: 'User:'
sampling: temp = 0.8,top_k = 40,top_p = 0.95,repeat_last_n = 64,repeat_penalty = 1.1

generate: n_ctx = 512, n_batch = 8, n_predict = 128, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.
 - If you want to submit another line, end your input in '\'.


 2014-15 Women's Lacrosse News
Virginia Tech Announces Schedule for its Second Season of Lax
BLACKSBURG, Va. – Virginia Tech women’s lacrosse head coach Shannon Smith announced her team’s 2015 schedule on Thursday morning (December 4). The Hokies open their season at home against Longwood (Jan. 29) before heading to North Carolina for a Feb. 1 matchup with Duke.
Six Women's Lacrosse Players Named All

Not sure if I messed up something with the params, but this example in interactive mode doesn't provide any input, it just starts spitting out words - and stops after some number. Also, if I change the high_level example from chat to llama, it just segfault:

baki@srv-mini:~/llama-cpp-python$ python3 examples/low_level_api/low_level_api_llama_cpp.py --model ~/llama.cpp/models/7B/ggml-llama-7b-q4.bin
error loading model: failed to open ../models/7B/ggml-model.bin: No such file or directory
llama_init_from_file: failed to load model
Segmentation fault (core dumped)

Timings of chat example:

Model loading (until first word shows): almost immediately
Number of words per second: around 3 words/sec (Awesomee!)

Third run, I try with high_level_api example, as you requested:

baki@srv-mini:~/llama-cpp-python$ python3 examples/high_level_api/high_level_api_inference.py --model ~/llama.cpp/models/7B/ggml-llama-7b-q4.bin
llama.cpp: loading model from /home/baki/llama.cpp/models/7B/ggml-llama-7b-q4.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: f16        = 2
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 2052.00 MB per state)
llama_init_from_file: kv self size  =  512.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 

llama_print_timings:        load time =  2471.73 ms
llama_print_timings:      sample time =    42.13 ms /    48 runs   (    0.88 ms per run)
llama_print_timings: prompt eval time =  5722.36 ms /    19 tokens (  301.18 ms per token)
llama_print_timings:        eval time = 17361.61 ms /    47 runs   (  369.40 ms per run)
llama_print_timings:       total time = 23131.09 ms
{
  "id": "cmpl-c6a253f3-102e-4eec-b888-a9442690e47f",
  "object": "text_completion",
  "created": 1681205386,
  "model": "/home/baki/llama.cpp/models/7B/ggml-llama-7b-q4.bin",
  "choices": [
    {
      "text": "Question: What are the names of the planets in the solar system? Answer: 1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune 9. Pluto 10.",
      "index": 0,
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 19,
    "completion_tokens": 48,
    "total_tokens": 67
  }
}

Timings:
This took around 25 seconds! (+ subsequent runs don't speed things up, I always get the similar time!)

In a final run, I try the llama-cpp-python from the pip, firing up instructions from the /docs web client (v1/chat/completion endpoint with the default JSON request) - first instruction response shows after around the same time (25 seconds) and if I bump up the threads from 2 to 4 I get the response at around 12-16 seconds. Almost everything I've tried so far to speed this up even further - failed. Also, when I build python wheel from the source - the timings stays the same.

Here's an example:

baki@srv-mini:~/llama.cpp$ python3 -m llama_cpp.server
llama.cpp: loading model from ./models/7B/ggml-alpaca-7b-q4.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: f16        = 2
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
llama_init_from_file: kv self size  =  256.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
INFO:     Started server process [315210]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     192.168.0.12:50344 - "GET / HTTP/1.1" 404 Not Found
INFO:     192.168.0.12:50344 - "GET /docs HTTP/1.1" 200 OK
INFO:     192.168.0.12:50344 - "GET /openapi.json HTTP/1.1" 200 OK

llama_print_timings:        load time =  1376.85 ms
llama_print_timings:      sample time =     9.75 ms /    11 runs   (    0.89 ms per run)
llama_print_timings: prompt eval time = 10696.38 ms /    70 tokens (  152.81 ms per token)
llama_print_timings:        eval time =  2017.02 ms /    10 runs   (  201.70 ms per run)
llama_print_timings:       total time = 12724.85 ms
INFO:     192.168.0.12:50345 - "POST /v1/chat/completions HTTP/1.1" 200 OK

I can run all these examples on our production server, which is a 4 Xeon IBM beast with a 128GB of memory to see if there's any difference.

TODO:

  • I haven't checked the SHA256 checksums of the model files yet (I hope the files are okay)
  • will run py-spy thing later today
  • I wonder if I'm observing the latest timings right - because this web service returns response after ALL tokens are generated, not one by one (as in some websocket examples)

@bratislav
Copy link

bratislav commented Apr 11, 2023

@abetlen Here are the outputs of the py-spy (usign 0.1.30):

Low-level:

baki@srv-mini:~/llama-cpp-python$ py-spy record --native -o profile.svg -- python3 examples/low_level_api/low_level_api_llama_cpp.py -m ~/llama.cpp/models/7B/ggml-llama-7b-q4.bin 
py-spy> Sampling process 100 times a second. Press Control-C to exit.

llama.cpp: loading model from /home/baki/llama.cpp/models/7B/ggml-llama-7b-q4.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: f16        = 2
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 2052.00 MB per state)
py-spy> 1.15s behind in sampling, results may be inaccurate. Try reducing the sampling rate
llama_init_from_file: kv self size  =  512.00 MB
 

### Instruction:
What is the capital of France?

### Response:
Paris.

### Questions:
Who was the first president of America?


llama_print_timings:        load time =  2942.98 ms
llama_print_timings:      sample time =    19.85 ms /    20 runs   (    0.99 ms per run)
llama_print_timings: prompt eval time =  6159.05 ms /    28 tokens (  219.97 ms per token)
llama_print_timings:        eval time =  4763.31 ms /    19 runs   (  250.70 ms per run)
llama_print_timings:       total time = 12137.12 ms

py-spy> Stopped sampling because process exited
py-spy> Wrote flamegraph data to 'profile.svg'. Samples: 1250 Errors: 0

profile

Here is for high_level API (took almost 25 seconds!):

baki@srv-mini:~/llama-cpp-python$ py-spy record --native -o profile2.svg -- python3 examples/high_level_api/high_level_api_inference.py -m ~/llama.cpp/models/7B/ggml-llama-7b-q4.bin 
py-spy> Sampling process 100 times a second. Press Control-C to exit.

llama.cpp: loading model from /home/baki/llama.cpp/models/7B/ggml-llama-7b-q4.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: f16        = 2
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 2052.00 MB per state)
llama_init_from_file: kv self size  =  512.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 

llama_print_timings:        load time =  2629.82 ms
llama_print_timings:      sample time =    45.66 ms /    48 runs   (    0.95 ms per run)
llama_print_timings: prompt eval time =  6087.97 ms /    19 tokens (  320.42 ms per token)
llama_print_timings:        eval time = 18470.45 ms /    47 runs   (  392.99 ms per run)
llama_print_timings:       total time = 24609.50 ms
{
  "id": "cmpl-94bf3241-852a-469e-8710-fd4702e280d4",
  "object": "text_completion",
  "created": 1681212374,
  "model": "/home/baki/llama.cpp/models/7B/ggml-llama-7b-q4.bin",
  "choices": [
    {
      "text": "Question: What are the names of the planets in the solar system? Answer: 1. Mercury 2. Venus 3. Earth 4. Mars 5. Jupiter 6. Saturn 7. Uranus 8. Neptune 9. Pluto 10.",
      "index": 0,
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 19,
    "completion_tokens": 48,
    "total_tokens": 67
  }
}

profile2

@acheong08
Copy link

Latest version fixed my issues

profile2

profile

@acheong08
Copy link

Difference seems to be minimal now

@MartinPJB
Copy link
Author

It seems like it is to me as well. It now takes just a few seconds instead of those 18 minutes I had before. I don't know what happened but I am glad it got fixed. Thanks!

@abetlen
Copy link
Owner

abetlen commented Apr 11, 2023

It might have just been a regression in the base llama.cpp library but who knows

@abetlen abetlen closed this as completed Apr 11, 2023
@bratislav
Copy link

bratislav commented Apr 12, 2023

@abetlen @MartinPJB @acheong08 Guys, I've updated both llama.cpp and llama-cpp-python and I'm still getting the same timings as before - no change whatsoever :( I'm starting to think my model is wrong, are you guys willing to share the link from where you downloaded the .bin files (or at least sha256 sum so I can compare with mine)?

Did you just create the *.bin file from weights or you have downloaded already-made file in ggml? Are you using ggml v2 maybe?
I've tried with another alpaca bin file I found on the web, and the inference in high_level example is even slower - 44 seconds :(

Thanks

@acheong08
Copy link

@bratislav The issue was not with the bin files. I tested with the same model throughout. Specifically https://huggingface.co/Pi3141/gpt4-x-alpaca-native-13B-ggml

@acheong08
Copy link

make sure to cd vendor && rm -rf llama.cpp && git clone https://github.com/ggerganov/llama.cpp/ && cd .. && pip3 install .

@acheong08
Copy link

this fetches you the latest version

@0xdevalias
Copy link

0xdevalias commented Apr 14, 2023

Latest version fixed my issues

Difference seems to be minimal now

It seems like it is to me as well. It now takes just a few seconds instead of those 18 minutes I had before. I don't know what happened but I am glad it got fixed. Thanks!

It might have just been a regression in the base llama.cpp library but who knows

@acheong08 @abetlen @MartinPJB Unsure if the slowness here was related, but there was a bunch of performance regression stuff in the base lib that I was vaguely following that was resolved recently:


Thank you for this hard work - I missed this regression because I rarely run generations with more than a few tens of tokens. The problem is that the transpose operation for the V matrix is very slow and becomes slower and slower with every new token added.

I think I have provided a fix here: ggerganov/llama.cpp#775

Tested only on M1 so far

Originally posted by @ggerganov in ggerganov/llama.cpp#603 (comment)


https://github.com/ggerganov/llama.cpp/issues/#775 was merged pretty quickly, cyyynthia if it's not too much trouble, could you update your pretty graphs with that?

@sw The graphs aren't super pretty this time because I didn't take the time to properly close everything and had a bunch of things open in background while the test was running 😅

That being said, the regression appears to be gone. 🎉 Here's the graphs and the raw CSV:
image
token_times.csv

Originally posted by @cyyynthia in ggerganov/llama.cpp#603 (comment)


Well I guess that settles it. The Great Wizard Georgi has saved the day! Thanks to @cyyynthia and @KASR for putting in the hard work of tracking this down.

I have opened https://github.com/ggerganov/llama.cpp/issues/#790 to track the discrepancy in the different partial times vs total time. I think this issue could be closed. Thanks everyone.

Originally posted by @sw in ggerganov/llama.cpp#603 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants