Avoid heavy V transpose operation + improvements #775

ggerganov · 2023-04-05T14:11:22Z

Human generated notes

I believe this should resolve: #603 #677 #767

When I deprecated the non-contiguous ggml_mul_mat() branch in #439, I grossly underestimated the cost of transposing the V matrix on every token. It's a very heavy memory operation, with tons of cache misses.

To solve this, we now store V in the KV cache in transposed state, so we don't need to do it for future tokens.

ggml :

added ggml_view_3d()
ggml_view_tensor() now inherits the stride too
reimplement ggml_cpy() to account for dst stride
no longer require tensor->data to be memory aligned

llama :

compute RoPE on 32-bit tensors (should be more accurate)
store RoPE-ed K in the KV cache
store transposed V in the KV cache (significant speed-up)
avoid unnecessary Q copy

`🤖 Generated by Copilot at 1868f6c`

Summary

✨⚡📈

This pull request enhances the ggml library and the llama project by adding new tensor operations, optimizing existing ones, and using RoPE for self-attention. It also improves the debuggability of the llama project by adding optional timing and plotting code. The main files affected are ggml.c, ggml.h, and llama.cpp.

Sing, O Muse, of the mighty ggml, the wondrous library of tensors
That skilled programmers devised with cunning and crafty art
To aid the llama project, the swift and fluent speaker of words
That emulates the GPT-2, the wise and powerful oracle of language

Walkthrough

Optimize the self-attention mechanism in llama_eval_internal by using RoPE and view tensors (link, link)
Improve the performance of tensor duplication operations by avoiding unnecessary checks and using memcpy when possible (link, link, link, link)
Add a new function ggml_view_3d to create 3-dimensional view tensors from source tensors (link, link)
Comment out alignment checks in ggml_new_tensor_impl to speed up tensor creation (link)
Copy byte strides in ggml_view_tensor to preserve the source tensor layout (link)
Add optional code for debugging and profiling in llama_eval_internal (link)

Edit:

Merging this since we observe M1 and Windows working good.
If you notice something not right, feel free to revert. But I think it should be good

ggml : - added ggml_view_3d() - ggml_view_tensor() now inherits the stride too - reimplement ggml_cpy() to account for dst stride - no longer require tensor->data to be memory aligned llama : - compute RoPE on 32-bit tensors (should be more accurate) - store RoPE-ed K in the KV cache - store transposed V in the KV cache (significant speed-up) - avoid unnecessary Q copy

rabidcopy · 2023-04-05T15:26:11Z

Changes output, but after several runs the uplift in performance seems promising. Best results after running ./main -m ../alpaca-7b-native.bin -b 32 -p "Building a website can be done in 10 simple steps:" -n 100 -t 6 --seed 1 5 times each for each branch.
Master

Building a website can be done in 10 simple steps:
Identify the purpose of the website and the target audience.
Decide on the structure, design and content for the website.
Acquire the domain name, hosting service, and any other services needed to build the website.
Develop the website using HTML/CSS and JavaScript/jQuery.
Test the website to make sure it works properly.
Publish the website online.
Update and maintain the website regularly.
Monitor the website’s analytics and performance.
Ad
llama_print_timings:        load time =  7551.77 ms
llama_print_timings:      sample time =    65.71 ms /   100 runs   (    0.66 ms per run)
llama_print_timings: prompt eval time =  6902.91 ms /    14 tokens (  493.07 ms per token)
llama_print_timings:        eval time = 18582.74 ms /    99 runs   (  187.70 ms per run)
llama_print_timings:       total time = 26201.43 ms

PR

Building a website can be done in 10 simple steps:
Determine your goals: Before you start building a website, decide what you want it to achieve.
Plan and design: Draw up a plan of the structure and content of your website, and then create attractive visuals that match your branding.
Develop the site: Gather the code and software required to develop the site’s structure and content according to your plan.
Test and troubleshoot: Test out the website thoroughly to make sure it works correctly and fix any
llama_print_timings:        load time =  5587.43 ms
llama_print_timings:      sample time =    65.25 ms /   100 runs   (    0.65 ms per run)
llama_print_timings: prompt eval time =  5030.00 ms /    14 tokens (  359.29 ms per token)
llama_print_timings:        eval time = 15437.18 ms /    99 runs   (  155.93 ms per run)
llama_print_timings:       total time = 21091.10 ms

Bonus result of this PR combined with #768.

Building a website can be done in 10 simple steps:
Determine your goals: Before you start building a website, decide what you want it to achieve.
Plan and design: Draw up a plan of the structure and content of your website, and then create attractive visuals that match your branding.
Develop the site: Gather the code and software required to develop the site’s structure and content according to your plan.
Test and troubleshoot: Test out the website thoroughly to make sure it works correctly and fix any
llama_print_timings:        load time =  2481.76 ms
llama_print_timings:      sample time =    77.38 ms /   100 runs   (    0.77 ms per run)
llama_print_timings: prompt eval time =  1815.66 ms /    14 tokens (  129.69 ms per token)
llama_print_timings:        eval time = 14191.88 ms /    99 runs   (  143.35 ms per run)
llama_print_timings:       total time = 16752.25 ms

So roughly 30ms faster. Then 40-45ms faster with #768. Though not all who tested that PR got the same uplifts I did strangely enough. Would probably be worth checking perplexity.

howard0su · 2023-04-05T15:43:09Z

llama.cpp


-            // KQV = transpose(V) * KQ_soft_max
-            struct ggml_tensor * KQV = ggml_mul_mat(ctx0, V_trans, KQ_soft_max);
+#if 1


shall we check n_token == 1 to take the different branch?

I have a few ideas that might fix this without need to branch here. Will try them first

KASR · 2023-04-05T15:55:58Z

I will test further later on, but I already want to post some intermediate results. The speed is constant throughout the printing, i.e. the performance degradation has been resolved on my pc (windows/cmake/vs).

In attachment you can find the result of ./main -m ./models/7B/ggml-model-q4_0.bin -p "This is a long story about how programming came to be, including an overview of all important programming languages used so far:" -n 2048 -t 18 --temp 0.8 -c 2048 -s -1

The output reached 1399 words (counted by MS Word) and then halted. After some time I just used ctrl+c to exit. I've tried some other options but many times the output simply places [end of text] prints the timings and exits 😅

So I will try to reproduce the behavior of the one run where it got stuck. All other runs produced a shorter text and exit cleanly.

output_llama_fix_cpy_n_2048.txt

ggerganov · 2023-04-05T16:02:38Z

@KASR
You can use the --ignore-eos flag to never get [end of text].
The halting most likely occurred because the context was full, which triggers the context swap procedure, which is quite heavy (especially if you are not using BLAS) and you have to wait for a few seconds for it to continue

KASR · 2023-04-05T16:25:27Z

@KASR You can use the --ignore-eos flag to never get [end of text]. The halting most likely occurred because the context was full, which triggers the context swap procedure, which is quite heavy (especially if you are not using BLAS) and you have to wait for a few seconds for it to continue

Yes indeed 😅

The output preserves the performance, I reduced n a bit to account for the prompt and avoid the context swap (so that it's not reflected in the total time). But even when increasing n above n_ctx the printing itself preserves the speed (after waiting a bit for the context swap).

Timings for one of the long runs:

llama_print_timings:        load time =  1106.80 ms
llama_print_timings:      sample time =   773.61 ms /  1950 runs   (    0.40 ms per run)
llama_print_timings: prompt eval time =  1823.19 ms /    26 tokens (   70.12 ms per token)
llama_print_timings:        eval time = 236098.45 ms /  1949 runs   (  121.14 ms per run)
llama_print_timings:       total time = 239239.01 ms

I also did a quick test with various threads, the output remains consistent. See attachment: output_llama_fix_cpy_various_threads.txt

So it appears the suggested changes resolve #603 (at least on my pc / configuration) and fixing the issues that #439 tackled.

MillionthOdin16 · 2023-04-05T19:23:52Z

Thanks Georgi! Awesome work! And thanks to everyone else who helped narrow down and evaluate performance along the way! Was neat to see everybody come together to help figure this out ❤️

Dang, that GitHub AI assisted PR 🔥🔥😂

ggerganov mentioned this pull request Apr 5, 2023

Performance Discrepancy: gpt4all Faster than Optimized llama.cpp #603

Closed

ggerganov added the high priority Very important issue label Apr 5, 2023

howard0su reviewed Apr 5, 2023

View reviewed changes

ggerganov merged commit 986b6ce into master Apr 5, 2023

ggerganov deleted the fix-cpy branch April 5, 2023 19:07

rabidcopy mentioned this pull request Apr 5, 2023

Demo usage of Flash Attention #778

Closed

KerfuffleV2 mentioned this pull request Apr 6, 2023

Making results independent from threadcount/batch size (from llama.cpp) rustformers/llm#67

Closed

deadprogram mentioned this pull request Apr 10, 2023

Update llama to pickup "Avoid heavy V transpose operation + improvements" mudler/LocalAI#11

Closed

ggerganov mentioned this pull request Jun 16, 2023

add falcon7b example ggerganov/ggml#231

Closed

This was referenced Jun 17, 2023

llama : add Falcon LLM support #1602

Closed

Slowdown with tokens cmp-nct/ggllm.cpp#6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid heavy V transpose operation + improvements #775

Avoid heavy V transpose operation + improvements #775

ggerganov commented Apr 5, 2023 •

edited

Loading

rabidcopy commented Apr 5, 2023 •

edited

Loading

howard0su Apr 5, 2023

ggerganov Apr 5, 2023

KASR commented Apr 5, 2023

ggerganov commented Apr 5, 2023

KASR commented Apr 5, 2023

MillionthOdin16 commented Apr 5, 2023 •

edited

Loading

Avoid heavy V transpose operation + improvements #775

Avoid heavy V transpose operation + improvements #775

Conversation

ggerganov commented Apr 5, 2023 • edited Loading

Human generated notes

🤖 Generated by Copilot at 1868f6c

Summary

Walkthrough

rabidcopy commented Apr 5, 2023 • edited Loading

howard0su Apr 5, 2023

Choose a reason for hiding this comment

ggerganov Apr 5, 2023

Choose a reason for hiding this comment

KASR commented Apr 5, 2023

ggerganov commented Apr 5, 2023

KASR commented Apr 5, 2023

MillionthOdin16 commented Apr 5, 2023 • edited Loading

ggerganov commented Apr 5, 2023 •

edited

Loading

`🤖 Generated by Copilot at 1868f6c`

rabidcopy commented Apr 5, 2023 •

edited

Loading

MillionthOdin16 commented Apr 5, 2023 •

edited

Loading