Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid heavy V transpose operation + improvements #775

Merged
merged 1 commit into from
Apr 5, 2023
Merged

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Apr 5, 2023

Human generated notes

I believe this should resolve: #603 #677 #767

When I deprecated the non-contiguous ggml_mul_mat() branch in #439, I grossly underestimated the cost of transposing the V matrix on every token. It's a very heavy memory operation, with tons of cache misses.

To solve this, we now store V in the KV cache in transposed state, so we don't need to do it for future tokens.

ggml :

  • added ggml_view_3d()
  • ggml_view_tensor() now inherits the stride too
  • reimplement ggml_cpy() to account for dst stride
  • no longer require tensor->data to be memory aligned

llama :

  • compute RoPE on 32-bit tensors (should be more accurate)
  • store RoPE-ed K in the KV cache
  • store transposed V in the KV cache (significant speed-up)
  • avoid unnecessary Q copy

🤖 Generated by Copilot at 1868f6c

Summary

✨⚡📈

This pull request enhances the ggml library and the llama project by adding new tensor operations, optimizing existing ones, and using RoPE for self-attention. It also improves the debuggability of the llama project by adding optional timing and plotting code. The main files affected are ggml.c, ggml.h, and llama.cpp.

Sing, O Muse, of the mighty ggml, the wondrous library of tensors
That skilled programmers devised with cunning and crafty art
To aid the llama project, the swift and fluent speaker of words
That emulates the GPT-2, the wise and powerful oracle of language

Walkthrough

  • Optimize the self-attention mechanism in llama_eval_internal by using RoPE and view tensors (link, link)
  • Improve the performance of tensor duplication operations by avoiding unnecessary checks and using memcpy when possible (link, link, link, link)
  • Add a new function ggml_view_3d to create 3-dimensional view tensors from source tensors (link, link)
  • Comment out alignment checks in ggml_new_tensor_impl to speed up tensor creation (link)
  • Copy byte strides in ggml_view_tensor to preserve the source tensor layout (link)
  • Add optional code for debugging and profiling in llama_eval_internal (link)

Edit:

Merging this since we observe M1 and Windows working good.
If you notice something not right, feel free to revert. But I think it should be good

ggml :

- added ggml_view_3d()
- ggml_view_tensor() now inherits the stride too
- reimplement ggml_cpy() to account for dst stride
- no longer require tensor->data to be memory aligned

llama :

- compute RoPE on 32-bit tensors (should be more accurate)
- store RoPE-ed K in the KV cache
- store transposed V in the KV cache (significant speed-up)
- avoid unnecessary Q copy
@rabidcopy
Copy link
Contributor

rabidcopy commented Apr 5, 2023

Changes output, but after several runs the uplift in performance seems promising. Best results after running ./main -m ../alpaca-7b-native.bin -b 32 -p "Building a website can be done in 10 simple steps:" -n 100 -t 6 --seed 1 5 times each for each branch.
Master

Building a website can be done in 10 simple steps:
Identify the purpose of the website and the target audience.
Decide on the structure, design and content for the website.
Acquire the domain name, hosting service, and any other services needed to build the website.
Develop the website using HTML/CSS and JavaScript/jQuery.
Test the website to make sure it works properly.
Publish the website online.
Update and maintain the website regularly.
Monitor the website’s analytics and performance.
Ad
llama_print_timings:        load time =  7551.77 ms
llama_print_timings:      sample time =    65.71 ms /   100 runs   (    0.66 ms per run)
llama_print_timings: prompt eval time =  6902.91 ms /    14 tokens (  493.07 ms per token)
llama_print_timings:        eval time = 18582.74 ms /    99 runs   (  187.70 ms per run)
llama_print_timings:       total time = 26201.43 ms

PR

Building a website can be done in 10 simple steps:
Determine your goals: Before you start building a website, decide what you want it to achieve.
Plan and design: Draw up a plan of the structure and content of your website, and then create attractive visuals that match your branding.
Develop the site: Gather the code and software required to develop the site’s structure and content according to your plan.
Test and troubleshoot: Test out the website thoroughly to make sure it works correctly and fix any
llama_print_timings:        load time =  5587.43 ms
llama_print_timings:      sample time =    65.25 ms /   100 runs   (    0.65 ms per run)
llama_print_timings: prompt eval time =  5030.00 ms /    14 tokens (  359.29 ms per token)
llama_print_timings:        eval time = 15437.18 ms /    99 runs   (  155.93 ms per run)
llama_print_timings:       total time = 21091.10 ms

Bonus result of this PR combined with #768.

Building a website can be done in 10 simple steps:
Determine your goals: Before you start building a website, decide what you want it to achieve.
Plan and design: Draw up a plan of the structure and content of your website, and then create attractive visuals that match your branding.
Develop the site: Gather the code and software required to develop the site’s structure and content according to your plan.
Test and troubleshoot: Test out the website thoroughly to make sure it works correctly and fix any
llama_print_timings:        load time =  2481.76 ms
llama_print_timings:      sample time =    77.38 ms /   100 runs   (    0.77 ms per run)
llama_print_timings: prompt eval time =  1815.66 ms /    14 tokens (  129.69 ms per token)
llama_print_timings:        eval time = 14191.88 ms /    99 runs   (  143.35 ms per run)
llama_print_timings:       total time = 16752.25 ms

So roughly 30ms faster. Then 40-45ms faster with #768. Though not all who tested that PR got the same uplifts I did strangely enough. Would probably be worth checking perplexity.


// KQV = transpose(V) * KQ_soft_max
struct ggml_tensor * KQV = ggml_mul_mat(ctx0, V_trans, KQ_soft_max);
#if 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we check n_token == 1 to take the different branch?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few ideas that might fix this without need to branch here. Will try them first

@KASR
Copy link
Contributor

KASR commented Apr 5, 2023

I will test further later on, but I already want to post some intermediate results. The speed is constant throughout the printing, i.e. the performance degradation has been resolved on my pc (windows/cmake/vs).

In attachment you can find the result of ./main -m ./models/7B/ggml-model-q4_0.bin -p "This is a long story about how programming came to be, including an overview of all important programming languages used so far:" -n 2048 -t 18 --temp 0.8 -c 2048 -s -1

The output reached 1399 words (counted by MS Word) and then halted. After some time I just used ctrl+c to exit. I've tried some other options but many times the output simply places [end of text] prints the timings and exits 😅

So I will try to reproduce the behavior of the one run where it got stuck. All other runs produced a shorter text and exit cleanly.

output_llama_fix_cpy_n_2048.txt

@ggerganov
Copy link
Owner Author

@KASR
You can use the --ignore-eos flag to never get [end of text].
The halting most likely occurred because the context was full, which triggers the context swap procedure, which is quite heavy (especially if you are not using BLAS) and you have to wait for a few seconds for it to continue

@KASR
Copy link
Contributor

KASR commented Apr 5, 2023

@KASR You can use the --ignore-eos flag to never get [end of text]. The halting most likely occurred because the context was full, which triggers the context swap procedure, which is quite heavy (especially if you are not using BLAS) and you have to wait for a few seconds for it to continue

Yes indeed 😅

The output preserves the performance, I reduced n a bit to account for the prompt and avoid the context swap (so that it's not reflected in the total time). But even when increasing n above n_ctx the printing itself preserves the speed (after waiting a bit for the context swap).

Timings for one of the long runs:

llama_print_timings:        load time =  1106.80 ms
llama_print_timings:      sample time =   773.61 ms /  1950 runs   (    0.40 ms per run)
llama_print_timings: prompt eval time =  1823.19 ms /    26 tokens (   70.12 ms per token)
llama_print_timings:        eval time = 236098.45 ms /  1949 runs   (  121.14 ms per run)
llama_print_timings:       total time = 239239.01 ms

I also did a quick test with various threads, the output remains consistent. See attachment: output_llama_fix_cpy_various_threads.txt

So it appears the suggested changes resolve #603 (at least on my pc / configuration) and fixing the issues that #439 tackled.

@ggerganov ggerganov merged commit 986b6ce into master Apr 5, 2023
@ggerganov ggerganov deleted the fix-cpy branch April 5, 2023 19:07
@MillionthOdin16
Copy link

MillionthOdin16 commented Apr 5, 2023

Thanks Georgi! Awesome work! And thanks to everyone else who helped narrow down and evaluate performance along the way! Was neat to see everybody come together to help figure this out ❤️

Dang, that GitHub AI assisted PR 🔥🔥😂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority Very important issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance Discrepancy: gpt4all Faster than Optimized llama.cpp
5 participants