Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Token generation is extremely slow when using 13B models on an M1 Pro with llama.cpp, but it runs at a fine speed with Dalai (which uses an older version of llama.cpp) #767

Closed
serovar opened this issue Apr 4, 2023 · 30 comments
Labels
performance Speed related topics

Comments

@serovar
Copy link

serovar commented Apr 4, 2023

Expected Behavior

I can load a 13B model and generate text with it with decent token generation speed with a M1 Pro CPU (16 GB RAM).

Current Behavior

When I load a 13B model with llama.cpp (like Alpaca 13B or other models based on it) and I try to generate some text, every token generation needs several seconds, to the point that these models are not usable for how unbearably slow they are. But they works with reasonable speed using Dalai, that uses an older version of llama.cpp

Environment and Context

MacBook Pro with M1 Pro, 16 GB RAM, macOS Ventura 13.3.

Python 3.9.16

GNU Make 3.81

Apple clang version 14.0.3 (clang-1403.0.22.14.1)
Target: arm64-apple-darwin22.4.0
Thread model: posix

If you need some kind of log or other informations, I will post everything you need. Thanks in advance.

@j-f1
Copy link
Collaborator

j-f1 commented Apr 4, 2023

A couple of things to check:

  • Are you building in release mode? Debug mode would be significantly slower.
  • Which weight type are you using? f16 weights may be slower to run than q4_0 weights.

@serovar
Copy link
Author

serovar commented Apr 4, 2023

I followed the instructions in the repo (simply git clone the repo, cd to the folder and make, so I suppose by default it builds in release mode).

I tried with f16 (that I understands are the q4_1 ones) and q4_0 with similar results.

I can add that if I load a basic command like ./main -m ./models/vicuna-13B/ggml-vicuna-13b-4bit.bin -n 128 the generation speed is fine, but when using the example scripts (obviously with the correct model path) it becomes unbearably slow.

Video of ./main -m ./models/vicuna-13B/ggml-vicuna-13b-4bit.bin -n 128 :

CleanShot.2023-04-04.at.20.05.20.mp4

Video of the chat-13B example script:

CleanShot.2023-04-04.at.20.07.29.mp4

@j-f1
Copy link
Collaborator

j-f1 commented Apr 4, 2023

Hm, I don’t see a chat-13B-vicuna.sh example in the examples folder. Are you sure you’re filing this against the right repo?

@serovar
Copy link
Author

serovar commented Apr 4, 2023

It’s the chat-13B.sh but with the path for the vicuna model, I get the exact same results using the default chat-13B.sh with the standard alpaca 13B model (and with every other example script in that folder).

@x02Sylvie
Copy link

Probably relevant, #603

@j-f1
Copy link
Collaborator

j-f1 commented Apr 4, 2023

My guess is that one or more of the additional options the script is passing to ./main is causing the slowdown — if you start by removing all of them and then adding them back one at a time you should be able to track down which one is causing the slowdown.

@cyyynthia
Copy link
Collaborator

The older version used by Dalai (https://github.com/candywrap/llama.cpp) doesn't include the changes pointed out in #603 which appear to have caused significant performance regression. My assumption is that it's related to what we're investigating over there.

@KASR
Copy link
Contributor

KASR commented Apr 4, 2023

Might be the same as issue as #735 and #677 and indeed probably related to #603

@serovar
Copy link
Author

serovar commented Apr 4, 2023

My guess is that one or more of the additional options the script is passing to ./main is causing the slowdown — if you start by removing all of them and then adding them back one at a time you should be able to track down which one is causing the slowdown.

I tried changing and removing the additional options without results. Moreover the strangest thing is that now even the simple ./main -m ./models/vicuna-13B/ggml-vicuna-13b-4bit.bin -n 128 command leads to slow token generation (I did not change anything at all).

CleanShot.2023-04-04.at.23.56.29.mp4

@j-f1
Copy link
Collaborator

j-f1 commented Apr 4, 2023

I think we should close this out for now since it seems like the performance regression discussion is happening in #603 / #677 / #735

@j-f1 j-f1 closed this as not planned Won't fix, can't repro, duplicate, stale Apr 4, 2023
@MillionthOdin16
Copy link

MillionthOdin16 commented Apr 5, 2023

I don't know that this is specifically the issue that I describe in #603. His behavior is different than mine, and might be related to memory and swap issues. I've seen problems for some users since the mmap() update, and what he's describing sounds more similar to one of those where performance plummets to unbearable levels straight from the start.

I didn't even see this issue because it was closed before I got a chance to. The only reason I saw it was because it was referenced in my issue.

@serovar can you look at your disk utilization as it's processing?

@serovar
Copy link
Author

serovar commented Apr 5, 2023

Here:

CleanShot.2023-04-05.at.12.05.58.mp4

@MillionthOdin16
Copy link

MillionthOdin16 commented Apr 5, 2023

Okay, thanks. What's your RAM usage look like? Do you have spare ram or is it all allocated?

edit:
I'm not sure about the technical details of what's going on, but it looks like you might be using swap because your low on ram. To this point I've only seen it confirmed on Windows, but it might be related to the case where end mmap causes poor performance for users. Someone with more experience with the recent updates can probably help narrow it down.

But in relation to issue 603, this issue is different and I think it only started happening in the last few days for users. The reason you don't have the issue with dalai is because it doesn't have some of the more recent updates updates from this repo.

@j-f1 j-f1 reopened this Apr 5, 2023
@serovar
Copy link
Author

serovar commented Apr 5, 2023

It does not seems like the performance is directly correlated to RAM allocation. I installed htop to have a complete view on the process and I got to record two different sessions (one with decent speed and one very slow) with the same chat script.

Here the rapid one:

CleanShot.2023-04-05.at.13.00.20.mp4

Here the slow one:

CleanShot.2023-04-05.at.14.13.55.mp4

@MillionthOdin16
Copy link

Okay so just to be clear, You're running the same exact command and sometimes generation speed is horrible, and other times it generates normally?

One thing I noticed is that in the fast generation video it looked like your system time was 6 minutes. And the slow example, your uptime was much longer.

This may seem odd, but after restarting your computer and running the model for the first time, do you have faster generation?

@oldsj
Copy link

oldsj commented Apr 5, 2023

Can confirm I had this exact problem on an M1 Pro 16GB ram and rebooting fixed the issue 😄

@MillionthOdin16
Copy link

MillionthOdin16 commented Apr 6, 2023 via email

@oldsj
Copy link

oldsj commented Apr 6, 2023

Aaaaand after loading chrome and doing some other stuff its now back to being extremely slow. Also running this exact setup on an M1 Max with 64GB ram and not seeing the issue.

It doesn't seem to be spiking CPU or RAM usage, though it's reading from disk at ~830MB/s while trying to respond

@MillionthOdin16
Copy link

MillionthOdin16 commented Apr 6, 2023 via email

@oldsj
Copy link

oldsj commented Apr 6, 2023

ah adding --mlock actually seems to fix the issue! at the cost of slightly longer initial load

@vpellegrain
Copy link

ah adding --mlock actually seems to fix the issue! at the cost of slightly longer initial load

Same behavior here, thanks !

@serovar
Copy link
Author

serovar commented Apr 6, 2023

I can confirm that I have the same behavior as @oldsj and that adding --mlock does fix the issue.

Could I ask you what this option does? I read about it in this discussion but it is not very clear to me and they also talk about it needing root permissions (that I did not give).

@cbrendanprice
Copy link

cbrendanprice commented Apr 9, 2023

Specifying --mlock did not fix the issue for me. I should preface that I'm not using apple silicone, but I did experience poor performance on the 13B model comparatively with Dalai, as the OP specified.

Oddly, the only thing that ended up working for me was explicitly setting the number of threads to a substantially lower number than what is available on my system. Anecdotally, I got the best performance when specifying a thread count that is 1/4 of my available cores count.

For context I've an i9 12900k processor that has 24 virtual cores available. When running using all 24 virtual cores it's basically unusable; each token takes many, many seconds to generate. This continues to be the case until I set the thread count to about 3/4 (16) of my available cores, but even here there are intermittent pauses where nothing happens for several seconds. Only when I get to around half of my available cores (12) where it starts to perform nominally, with it seemingly improving further going down to 1/4 of my available cores.

Hope this insight helps someone.

@hanss0n
Copy link

hanss0n commented Apr 9, 2023

I'm experiencing similar issues using llama.cpp with the 13B model in Ubuntu 22.10. The token generation is initially fast, but becomes unbearably slow as more tokens are generated.

Here's the code snippet I'm using, but I'm not a C++ programmer and haven't worked with LLMs in ages though so the error might lie elsewhere:

    llama_context_params params = llama_context_default_params();
    ctx = llama_init_from_file(model_path.c_str(), params);
    if (!ctx)
    {
        throw std::runtime_error("Failed to initialize the llama model from file: " + model_path);
    }
    std::vector<llama_token> tokens(llama_n_ctx(ctx));
    int token_count = llama_tokenize(ctx, input.c_str(), tokens.data(), tokens.size(), true);
    if (token_count < 0) {
        throw std::runtime_error("Failed to tokenize the input text.");
    }
    tokens.resize(token_count);

    int n_predict = 50; // Number of tokens to generate
    std::string output_str;
    
    for (int i = 0; i < n_predict; ++i) {
        int result = llama_eval(ctx, tokens.data(), token_count, 0, 8);
        if (result != 0) {
            throw std::runtime_error("Failed to run llama inference.");
        }

        llama_token top_token = llama_sample_top_p_top_k(ctx, tokens.data(), token_count, 40, 0.9f, 1.0f, 1.0f);
        const char *output_token_str = llama_token_to_str(ctx, top_token);

        output_str += std::string(output_token_str);
        std::cout << output_str << std::endl;
        
        // Update context with the generated token
        tokens.push_back(top_token);
        token_count++;
    }

    return output_str;

Edit: Turns out I can't code. Works fine now, just had to get rid of the first token in each iteration.

@ssuukk
Copy link

ssuukk commented Apr 10, 2023

I'm getting similar experience with following line:

main -i --threads 12 --interactive-first -r "### Human:" --temp 0 -c 2048 -n -1 --ignore-eos --repeat_penalty 1.2 --instruct -m ggml-model-q4_1.bin

It's Ryzen 9 with 12 cores. Each token takes at least 2 seconds to appear.

@HanClinto
Copy link
Collaborator

@ssuukk Does adding --mlock help in your situation, or no?

@MillionthOdin16
Copy link

MillionthOdin16 commented Apr 11, 2023 via email

@ssuukk
Copy link

ssuukk commented Apr 11, 2023

@ssuukk Does adding --mlock help in your situation, or no?

With --mlock it is as slow as without, but maybe even slower - now it takes 2 seconds to generate parts of the words!

@kiratp
Copy link

kiratp commented Apr 30, 2023

Here is the scaling on an M1 Max (7B int4, maxed out GPU, 64 GB RAM)

image

@kiratp
Copy link

kiratp commented Apr 30, 2023

Oddly, the only thing that ended up working for me was explicitly setting the number of threads to a substantially lower number than what is available on my system. Anecdotally, I got the best performance when specifying a thread count that is 1/4 of my available cores count.

For context I've an i9 12900k processor that has 24 virtual cores available. When running using all 24 virtual cores it's basically unusable; each token takes many, many seconds to generate. This continues to be the case until I set the thread count to about 3/4 (16) of my available cores, but even here there are intermittent pauses where nothing happens for several seconds. Only when I get to around half of my available cores (12) where it starts to perform nominally, with it seemingly improving further going down to 1/4 of my available cores.

You should be setting -t to the number of P cores in your system. Your system has 8+8 IIRC (8*2 +8 = 24) so set -t 8

You can modify this script to measure the scaling on your machine - https://gist.github.com/KASR/dc3dd7f920f57013486583af7e3725f1#file-benchmark_threads_llama_cpp-py

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this issue Dec 19, 2023
* Fix rope scale with backwards compatibility

* Fix defaults

* Fix op

* Remove backwards compatibility

* Check single val
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Speed related topics
Projects
None yet
Development

No branches or pull requests