-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server : add speculative decoding support #10455
Conversation
1973399
to
7dc6ae5
Compare
From what I have read the goal is faster inference while retaining quality of the larger model. I am using rx6900xt with vulkan I get about 10-12 t/s with an incorrect configuration.
Flipping the models increased speed and the output looks similar. This makes sense since the -md is the draft model which is supposed to be the smaller model. I get about 16 t/s with the correct configuration.
Setting a lower context 2048, when the limit is reached the server crashed. |
c5ddee2
to
e80f758
Compare
@3Simplex What is the output of the following bench on your machine: llama-bench.exe -m "...Qwen2.5-Coder-7B-Instruct-Q8_0.gguf" -p 1,1,2,3,4,5,6,7,8,12,16,32 -r 20 -n 0 -ngl 99 -fa 1 |
.\llama-bench.exe -m "...\Qwen2.5-Coder-7B-Instruct-Q8_0.gguf" -p 1,1,2,3,4,5,6,7,8,12,16,32 -r 20 -n 0 -ngl 99 -fa 1
build: 0c74590 (4160) |
I tried out commit e80f758 with my P40s, 3xP40s and 3090. These are the commands for the baselines and the tests. Baseline:
With speculative model (just removed the
Tested it with curl using:
Data:
|
e80f758
to
d905266
Compare
Currently, it requires
The biggest benefit from speculative sampling is when you have more grounding. For example, if you have enough memory for a bigger context, you can try something like this: # get the llama.vim plugin source code
code=$(curl -s https://raw.githubusercontent.com/ggml-org/llama.vim/refs/heads/master/autoload/llama.vim | jq -sRr @json)
# ask qwen to implement something (speculative decoding disabled)
curl --request POST --url http://localhost:8033/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer no-key" -d "$(jq -n --arg code "$code" \
'{ messages: [{ role: "system", content: "You are an expert computer scientist. Respond only with code blocks. Do not add any other comments except code." }, { role: "user", content: "Suggest an improvement for the `chunk_sim` function using Levenstein distance: ```\($code)```" }], cache_prompt: true, top_k: 1, samplers: ["top_k"], "speculative.n_max": 0 }')" | jq -r .choices[0].message.content
# speculative decoding enabled
curl --request POST --url http://localhost:8033/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer no-key" -d "$(jq -n --arg code "$code" \
'{ messages: [{ role: "system", content: "You are an expert computer scientist. Respond only with code blocks. Do not add any other comments except code." }, { role: "user", content: "Suggest an improvement for the `chunk_sim` function using Levenstein distance: ```\($code)```" }], cache_prompt: true, top_k: 1, samplers: ["top_k"], "speculative.n_max": 16 }')" | jq -r .choices[0].message.content With CUDA, you might want to try setting |
Thank you for the guidance. Using d905266, I reran the tests. Results look quite good.
Server command:
Kept this pretty consistent, except for the 3xP40 run where I added Client side:
For the client side curl, I changed Here are the raw results. Some observations first:
3090 data
single P40
3xP40 (-sm row)
Code generated: function! s:chunk_sim(c0, c1)
let l:lines0 = join(a:c0, "\n")
let l:lines1 = join(a:c1, "\n")
let l:distance = levenshtein(l:lines0, l:lines1)
return 1 - (l:distance / max([strlen(l:lines0), strlen(l:lines1)]))
endfunction
function! levenshtein(s1, s2)
let l:len1 = strlen(a:s1)
let l:len2 = strlen(a:s2)
if l:len1 == 0
return l:len2
endif
if l:len2 == 0
return l:len1
endif
let l:dp = []
for i in range(l:len1 + 1)
call add(l:dp, [])
for j in range(l:len2 + 1)
call add(l:dp[i], 0)
endfor
endfor
for i in range(l:len1 + 1)
let l:dp[i][0] = i
endfor
for j in range(l:len2 + 1)
let l:dp[0][j] = j
endfor
for i in range(1, l:len1 + 1)
for j in range(1, l:len2 + 1)
let l:cost = (strcharpart(a:s1, i - 1, 1) == strcharpart(a:s2, j - 1, 1)) ? 0 : 1
let l:dp[i][j] = min([l:dp[i - 1][j] + 1, l:dp[i][j - 1] + 1, l:dp[i - 1][j - 1] + l:cost])
endfor
endfor
return l:dp[l:len1][l:len2]
endfunction |
Also, is |
Thanks for the detailed tests. The results are inflated because there is one tricky side effect from the caching - consecutive runs with the same prompt will reuse the previous draft context which combined with greedy sampling would make the drafting instantaneous. So basically, in the following data for example, only the first result is relevant:
i.e.
This was a bug - it is fixed now. You should be able to change Btw, here is another fun test that I came up with which uses less context and is suitable for speculation: # get top 10 stories from Hacker News
hn=$(curl -s https://hacker-news.firebaseio.com/v0/topstories.json | jq -r '.[:10] | @tsv' | tr '\t' '\n' | xargs -I {} curl -s "https://hacker-news.firebaseio.com/v0/item/{}.json" | jq -sRr @json)
# make a Markdown table based on some criteria
curl --request POST --url http://localhost:8033/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer no-key" -d "$(jq -n --arg hn "$hn" \
'{ messages: [{ role: "system", content: "You are a helpful text-editing assistant. Respond only with the requested text. Do not add any other comments to your response." }, { role: "user", content: "Extract a Markdown table that contains only stories about software engineering, AI or machine learning from the front-page of HN. The table should include: author, title, score, comments and an URL to the story: ```\($hn)```." }], cache_prompt: true, top_k: 1, samplers: ["top_k"], "speculative.n_max": 16 }')" | jq -r .choices[0].message.content |
Thanks. That seems a lot more realistic. I did some tests with a much shorter prompt: "write snake game in swift"
|
These numbers look reasonable. The speedup can vary in both ways based on the inputs, but enabling speculative should almost never result in slower than normal decoding. |
With this build I am up to 25t/s on first run generation with speculative decoding using 15/5 draft tokens. |
A bit of data with llama-3.1 70B and llama-3.2 1B as the draft model. Prompt: "write a story about the natural resources in Canada".
Server:
client (changed speculative.n_max between
|
Note that I am not very sure what happens with multiple GPUs, but it is possible that the draft model gets split across them, which is not desired (see the logs if that is the case). You would want to keep the draft model fully on one GPU. |
c277c4d
to
156aa6d
Compare
I wonder if it is possible to load draft and main model onto different backend. Ie a 7900xtx and P40 in a -cb process |
Could someone help me out? I'm trying to figure out where I'm going wrong. I have an M4 Pro with 64 GB of memory and when I use the 32-bit Qwen models (both the regular and coder versions) with Llama.cpp, I usually get about 11 tokens per second. I'm trying to see if I can boost the speed by using speculative decoding, but I haven't had much luck so far. For instance, when I run the following command: llama-speculative -m $HOME/.cache/lm-studio/models/bartowski/Qwen2.5-Coder-32B-Instruct-GGUF/Qwen2.5-Coder-32B-Instruct-Q4_K_L.gguf -p "write a ruby script to count the files in a directory recursively" -ngl 1000 -ngld 1000 -fa -md $HOME/.cache/lm-studio/models/bartowski/Qwen2.5-Coder-3B-GGUF/Qwen2.5-Coder-3B-Q4_0.gguf --top-k 1 --draft-max 16 --draft-min 5 I get this output:
There's no noticeable speed improvement. I also tried running the server with: llama-server -m $HOME/.cache/lm-studio/models/bartowski/Qwen2.5-Coder-32B-Instruct-GGUF/Qwen2.5-Coder-32B-Instruct-Q4_K_L.gguf -ngl 99 -ngld 99 -fa -md $HOME/.cache/lm-studio/models/bartowski/Qwen2.5-Coder-3B-GGUF/Qwen2.5-Coder-3B-Q4_0.gguf --top-k 1 --draft-max 16 --draft-min 5 --port 8033 But I still see the same token speed and no improvement. What am I missing here? |
I wonder if the default p-min of 0.9 is too high. I can get a further 20-30% speedup by setting a lower GPU: RTX 4060Ti 16GB
|
@PkmX The p-min = 0.9 is very conservative. The idea is to enable the speculation only for blocks of tokens where the LLM is very confident. With CUDA, it might be better to reduce p-min and also n-min. Feel free to experiment. |
Why am I getting a consistent 60 tokens/sec with llama-speculative while only 40 tokens/s through llama-server? Using the following two commands: llama-specutalive:
llama-server:
And then querying the exact same prompt through openweb ui, with temperature set to 0 and top-k to 1. Is there anything that can explain this rather big discrepancy? llama-speculative: |
Quick update: Dropping p-min increased the tokens/second for llama-server. I maxed out the speed at 53 tokens/second at 0.4 p-min, which remaining at 53 tokens/second all the way down to 0. Two questions that come to mind:
Update 2: Managed to obtain the following result: This was obtained through the following command: Over 60 tokens/second on a single 7900XTX! What a time to be alive :) Thank you so much for all your hard work @ggerganov ! Still very curious why I need different settings between llama-speculative and llama-server, but at least I am extremely happy I was able to fully unlock the potential of my 7900XTX |
@Mushoz |
Hi, I got it running on a NVIDIA GeForce RTX 4070 Ti SUPER: llama-server `
--model './models/Qwen2.5-Coder-32B-Instruct.IQ3_XXS.gguf' `
--ctx-size 8192 `
--threads 16 `
--n-gpu-layers 99 `
--flash-attn `
--top-k 1 `
--temp 0.1 `
--model-draft './models/Qwen2.5-Coder-0.5B-Instruct.IQ4_XS.gguf' `
--ctx-size-draft 8192 `
--n-gpu-layers-draft 99 `
--draft-p-min 0.5 `
--draft-min 5 `
--draft-max 16 llama-server `
--model './models/Qwen2.5-Coder-32B-Instruct.IQ3_XXS.gguf' `
--ctx-size 8192 `
--threads 16 `
--n-gpu-layers 99 `
--flash-attn `
--top-k 1 `
--temp 0.1
@ggerganov I could reproduce the findings of @PkmX that a |
Thanks for the feedback. Does |
@ggerganov I did some tests with Device: llama-server `
--model './models/Qwen2.5-Coder-32B-Instruct.IQ3_XXS.gguf' `
--ctx-size 8192 `
--threads 16 `
--n-gpu-layers 99 `
--flash-attn `
--top-k 1 `
--temp 0.1 `
--model-draft './models/Qwen2.5-Coder-0.5B-Instruct.IQ4_XS.gguf' `
--ctx-size-draft 8192 `
--n-gpu-layers-draft 99 `
--draft-p-min 0.5 `
--draft-min 3 `
--draft-max 16
Roughly doubling the t/s in optimal use cases is a very respectable speed bump! And falling back to nearly the performance as without speculative decoding for worst case scenarios is good to see. Performance drop of speculative decoding with cache quantization Combining speculative decoding with a
I was not expecting the cache quantization to have such a drastic impact. Is this to be expected, or a bug? |
I reported the same earlier here, I experienced overflow to shared GPU-mem of the KV cache although GPU memory still got room. For now I'm only using flash attention and not KV cache quantization with similar +100% speed bump in optimal cases. |
Hello However, I noticed that using speculative decoding with only the CPU can slow things down. Does speculative decoding require a GPU? |
I am curious about what numbers you see with/without speculative decoding on CPU. Please include details.
|
@dagbdagb
repository
AMD Ryzen 9 7940HS w/ Radeon 780M Graphics
Total 32.0GB available 27.8 GB (4GB for iGPU)
extract specific data from around 1500 tokens of text in Japanese (repeat 26 times)
(1)b4219(llama.cpp official binary)
5764.07 second
4968.42 second (2)locally built myself(b4227)
5807.13 second
5003.03 second (3)ROCm (b4215)
1576.67 second I feel that the 2B model may not be able to run fast enough on the CPU. This causing a bottleneck. |
Tried it with Qwen-2.5 on my 2x 3090s. No performance improvements whatsoever with 72b split across both GPUs. Lost some performance, actually. Ran a bunch of experiments using different hints I picked up here. No performance gains still, running the 14b variant on the same gpu as draft models (tried 0.5, 1.5b, 3b) or the other gpu, any permutation of draft-p-min and speculative.n_max. Best I got was 2-3 tps more (around 56 tps) as compare to ~54 I'm getting running without the draft model.
Maybe I'm missing something obvious, but no clue how other folks got such huge performance gains. |
I tested b4240 with 150% speed bump which must be an optimal use case with my non-optimal hardware (16GB P5000 + 8GB RTX 2070 Max-Q). These tweaks contributed to that: draft-min 0, draft-p-min 0.5 and temperature 0.1.
|
@mybyte Did you remember to set |
@JeroenAdam This should be fixed now (#10586) |
Yup. Tried both, UI and my app via REST. I wonder if a potential culprit is the model choice? |
First - wow ; is this some upgrade- Hats off. So I "mixed" two versions of the same model (llama3, 8B) , one reg quant, one imatrix quant (draft): `` `` Using a custom sampler profile (includes use of DRY, temp .84, and penalty samplers) ... here is part output from a ONE LINE prompt: _I stood frozen at my desk, staring into the abyss of the office park outside our high-rise building. My fingers were still poised above the keyboard, the cursor blinking mockingly at me like some malignant eye. The air was thick with tension, the kind that precedes an announcement no one wants to hear. But this was different. This was personal. As I gazed out over the sea of cubicles and fluorescent lights, something began to rumble beneath my feet, like thunder growing nearer. A low, ominous hum filled my ears, making my teeth ache. Suddenly, the sky scraper swayed, as if buffeted by a storm force wind. I grabbed onto the edge of my desk for support, my heart racing with alarm. My coworkers gasped and screamed as we all stumbled about, our chairs scraping against tile floors. Glass shattered somewhere nearby, and a panicked shout rose up from below. It felt like a seismic event, a catastrophic shift in the earth's crust that had nothing to do with geology. Then came the blast._ You don't get this level of detail at 8B. Or prose, or metaphors. This is like a MOE on steroids. A lot of steroids. |
@David-AU-github I don't think you quite understand how speculative decoding works. It will generate identical results to the non-speculative decoding case and will always generate what the main model would have generated on its own. It's only useful as a speed boost, it will not alter the output at all. |
@Mushoz Also - number of "drafts" ? Great variance here in output here. I also tested this model - both main and draft - separately to see if I could replicate this level of detail. Note : The two models - even thou the same model - one is imatrix version and the other non-imatrix version. In static tests (temp=0) each model will output different content from the same prompt. |
Normally a model can only predict one token at a time, because the token at position N depends on all previous tokens 0 through N-1. It would be much quicker if a model could predict not only token at position N, but also (depending on number of drafts) N+1, N+2, N+3. The reason why this is much faster, is because all the weight data of the big slow model only needs to be retrieved once for all 4 tokens, and LLMs are generally memory bandwidth limited. More calculations need to be done, but GPUs are extremely good at parallel computations which is what this is. But this cannot normally be done, because you need all previous tokens to be able to generate the next. What the draft model does, is generate a sequence of draft tokens N, N+1, N+2. The big model then assumes these to be true and generates 1 token ahead of each of these draft token, so it can do multiple at the same time. That means that despite the draft model generating N, N+1, N+2, the big model still generates these as well to verify them, but is able to do so in parallel (fast) instead of in sequence as is done in normal generation. If the base model generates a different token than what the draft model predicted, all subsequent tokens are discarded and the drafting is started all over again. This means that tokens are only retained if the draft model predicted the token the base model generated, which is why the output in speculative decoding is identical to what the base model would have generated in non speculative decoding. And this is also why a speed up is only observed if the predictions are good enough, because if not, all the extra work is simply discarded. |
Thank you. To clarify ; you get a speed increase in the vast majority of draft sequence tokens are "in agreement" between the draft and main model. The "draft" min / max is the number of tokens to generate for sequence? -> that is the min/max size of sequence? If the draft sequence / token(s) are not in agreement do both models "go back to the drawing board" and both models "redraft" a sequence of tokens? If this point is true - specifically both models - , that explains what I am observing and I can work with that. It sounds like when I am err... using speculative decoding in this way it is forcing different choices to occur than would otherwise happen. Almost like a strange version of temp and/or a "light" rep pen sampler ? or adding a random element into the generation? I have tested this method with other models / archs too and observing an increase in generational quality, with a decrease in T/S. |
The maximum amount of tokens to draft, is just that: How long of a sequence the draft model will draft. The higher this value is, the higher potential speed increase (up to a maximum, where you become compute bound), as long as the predictions are correct. For longer sequences, the draft model will most definitely generate something else than the main model. That means there is a sweet spot somewhere. Too high and you're just wasting work that is discarded, leading to slowdowns. The draft min is a variable that will tune how long your draft sequence needs to be at a minimum before the main model uses the draft predictions to do the final predictions. Some GPUs might not be terribly efficient at certain batch sizes, so it might be better to force them to higher batch sizes where the kernels are better optimized for batch processing. When the main model is in disagreement, all draft tokens are discarded and all tokens generated by the main model that were BASED ON THE INCORRECT DRAFT TOKEN(S) are discarded as well. Importantly, the token that was generated by the main model that proved the draft wrong is NOT discarded, and the main model essentially falls back to the normal non-speculative decoding case. Again, the main model will generate the exact same tokens with speculative decoding on vs off. The differences you are observing are purely due to sampler settings. Speculative decoding does not alter the output in any way, and anything you believe you are seeing is merely placebo. |
Excellent. thank you. RE: How long does the model fall back into "normal non-speculative decoding" operation? Until the next sequence of draft tokens from the draft model? |
It doesn't really fall back in the literal sense. What I mean is that the draft tokens that were generated incorrectly are simply ignored as if speculative decoding had never been turned on in the first place. Speculative decoding will remain effective in the sense that the draft model will immediately generate a new draft sequence after getting corrected by the main model and the main model will then again use that sequence to do the validation, just as it had been doing before. |
Note that this only applies to the incorrect draft token itself, and all subsequent tokens (as they are based on an incorrect preceding token). All correct draft tokens before the incorrect one are retained of course. If a draft sequence is 16 tokens long, it's perfectly possible the first 8 tokens are correct (which are retained) and the 9th is incorrect, which means token 9 through 16 of the sequence are discarded. |
@Mushoz Thank you for your help with this. There is an interesting divergence for creative use with very low bit quants VS mid/high which may benefit or be a benefit. (this is separate and part from spec decoding). Hmmm. Never mind two different models all together (with same vocab)... hmmm 2x. MOEs... raise even more questions. |
target #10362
Initial implementation that enables speculative decoding in
llama-server
. Test with this command:--draft-max
and--draft-min
might need tuningllama.cpp
Web UI clientTop K = 1
-devd
argument to put the draft model on only one of them (llama : accept a list of devices to use to offload a model #10497)Feedback is appreciated.
TODO:
server.params
to something else to avoid confusions