llama: implement YaRN RoPE scaling #2268

cebtenzzre · 2023-07-18T23:00:11Z

This is an implementation of YaRN RoPE scaling. See https://github.com/jquesnelle/yarn and the paper and errata.

TODO:

Add new GGUF key for how much context the base model was trained on
Support converting the new models to GGUF
Add backward implementations
Test new LLaMA implementation
Finish and test Falcon implementation

FNsi · 2023-07-20T12:45:02Z

Any guide to set para extrapolation and ntk? How do they work with previous two paras?

cebtenzzre · 2023-07-20T21:33:53Z

The upstream NTKv2 doesn't use --rope-freq-base, so it probably doesn't make sense to use it. It does use --rope-freq-scale, which works like linear scaling, and is supposed to be calibrated so that e.g. .25 scale actually gives you 8192 context. To use the default NTKv2, you should set --rope-ntk-factor and --rope-extrapolation-factor to 1, and set --rope-freq-scale appropriately. The lower the factors are, the less the respective scaling methods are mixed in, although I believe the graphs have been generated with both at 100% - the code automatically ramps them based on some experimentally determined thresholds.

cebtenzzre · 2023-07-21T22:14:54Z

I would appreciate help with the following:

Should I try to write a backwards implementation? NTKv1 still doesn't have one, so I don't have much to base it on.
I don't have a Mac to test the Metal code on. If anyone sees obvious flaws or can test it locally, let me know.
I'm going to try to run a perplexity benchmark against NTKv1 and linear scaling, but I don't know if my current hardware is up to the task.

ggerganov

Rename everywhere extrapolation_factor to ext_factor

ggml.c

ggerganov · 2023-07-31T08:47:40Z

No need for backwards implementation for now

cebtenzzre · 2023-08-14T19:35:52Z

Perplexity with NTKv2 may be worse because neither is the dynamic version, which AFAIK works better on non-finetuned models. But fine-tuned models are far superior anyway.

NTKv1 does not converge when fine-tuning, which is why NTKv2 exists. So until somebody publishes a model fine-tuned with NTKv2—maybe LLongMAv2 will be released after jquesnelle publishes the paper based on scaled-rope—the existing LLongMA, which uses regular linear interpolation (just like SuperHOT), is the state-of-the-art for long contexts.

cebtenzzre · 2023-08-31T20:55:28Z

The paper has been released. The resulting method is called YaRN. Apparently the models that use this technique are good to about 120k tokens of context.

More work will definitely be needed to use these models with llama.cpp.

bloc97 · 2023-09-06T06:11:56Z

Thank you for the llamacpp implementation of YaRN!

I'm just letting you know that

constant float max_pos_emb = 2048;

should be changed to 4096 for llama 2 models when using YaRN (default was 2048 because we did the most tests with llama 1 models)
This value should probably be saved inside of the model configs and be loaded on inference...

cebtenzzre · 2023-09-06T06:40:59Z

should be changed to 4096 for llama 2 models

Thanks for reminding me. I originally made this PR before GGUF was finished, so I hardcoded it in the meantime. I believe I can now use the value of llama.context_length for this purpose.

KerfuffleV2 · 2023-09-06T08:20:25Z

Would it be worth testing this with non-YaRN fine-tuned models? If so, any suggested settings? I can test it with ROCM.

Green-Sky · 2023-09-06T11:43:35Z

Thank you for the llamacpp implementation of YaRN!

I'm just letting you know that
constant float max_pos_emb = 2048;
should be changed to 4096 for llama 2 models when using YaRN (default was 2048 because we did the most tests with llama 1 models) This value should probably be saved inside of the model configs and be loaded on inference...

this needs to be a new GGUF kv, something like "rope_yarn_orig_ctx"

Thanks for reminding me. I originally made this PR before GGUF was finished, so I hardcoded it in the meantime. I believe I can now use the value of llama.context_length for this purpose.

llama.context_length should be the size of the finetune. eg 128Ki

bloc97 · 2023-09-06T19:55:27Z

this needs to be a new GGUF kv, something like "rope_yarn_orig_ctx"

Exactly, after finetuning a model with YaRN, we have to keep track of two values, one being the original context length (2048 for LLaMA or 4096 for Llama 2), and also the final context length (which can be calculated by multipling the original ctx length by the scale factor, eg. 4096 x 32 = 128Ki)

In this case, the constant constant float max_pos_emb = 2048; used in the equations must be equal to the original context size, not the final context size.

cebtenzzre · 2023-11-02T22:45:23Z

If ext_factor would never go negative,

I'd be fine with that solution. Would you like to make a PR?

edit: For some reason, I can't reproduce this on Linux with clang or gcc, or on an M2 Mac, at least on CPU.

edit 2: I can't build llama.cpp with Metal on my Mac:

c++ -I. -Icommon -D_XOPEN_SOURCE=600 -D_DARWIN_C_SOURCE -DNDEBUG -DGGML_USE_ACCELERATE -DACCELERATE_NEW_LAPACK -DACCELERATE_LAPACK_ILP64 -DGGML_USE_METAL  -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wmissing-declarations -Wmissing-noreturn -pthread  -Ofast -Wunreachable-code-break -Wunreachable-code-return -Wmissing-prototypes -Wextra-semi  examples/main/main.cpp ggml.o llama.o common.o sampling.o grammar-parser.o build-info.o console.o ggml-metal.o ggml-alloc.o ggml-backend.o ggml-quants.o -o main -framework Accelerate -framework Foundation -framework Metal -framework MetalKit 
0  0x102f3b648  __assert_rtn + 72
1  0x102e63c5c  ld::Fixup::applyFixup(ld::Atom const*, ld::LayoutLinkedImage const&, unsigned char*) const + 8268
2  0x102ef67d8  ___ZN2ld16LayoutExecutable27writeContentWithoutLinkEditENSt3__14spanIhLm18446744073709551615EEEy_block_invoke + 332
3  0x102ef6a14  void mapReduce<ld::Atom const*, mach_o::Error>(std::__1::span<ld::Atom const*, 18446744073709551615ul>, unsigned long, void (unsigned long, mach_o::Error&, std::__1::span<ld::Atom const*, 18446744073709551615ul>) block_pointer, void (std::__1::span<mach_o::Error, 18446744073709551615ul>) block_pointer) + 384
4  0x102ef6594  ld::LayoutExecutable::writeContentWithoutLinkEdit(std::__1::span<unsigned char, 18446744073709551615ul>, unsigned long long) + 1180
5  0x102efc020  ld::LayoutExecutable::writeToFile(char const*) + 15248
6  0x102eae2e8  main + 9424
ld: Assertion failed: (extras.otherInstrOffset != 0 && "Kind::arm64_adrp_ldr missing extra info"), function applyFixup, file Fixup.cpp, line 793.
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [main] Error 1

Seems like a bug in the XCode-provided clang 15?

KerfuffleV2 · 2023-11-03T01:15:27Z

#2268 (comment) - this seems to fix my problem. Really weird that it only has an effect when offloading that last non-repeating layer.

jxy · 2023-11-03T04:07:55Z

@cebtenzzre thanks for pushing the pr.

Now I'm testing this https://huggingface.co/TheBloke/Yarn-Mistral-7B-64k-GGUF and I'm getting

$ ./perplexity -t 1 -ngl 1 -m models/yarn-mistral-7b-64k.Q8_0.gguf -c 512 -f ../wikitext-2-raw/wiki.test.raw 2>/dev/null
[1]24.7243,[2]31.1885,[3]36.5431,[4]41.0809,^C

so something must be wrong, as the base model has

$ ./perplexity -t 1 -ngl 1 -m models/mistral-7b-v0.1.Q8_0.gguf -c 512 -f ../wikitext-2-raw/wiki.test.raw 2>/dev/null   
[1]3.9958,[2]4.4960,[3]5.2987,[4]5.9971,^C

The gguf is recognized correctly

llm_load_print_meta: rope scaling     = yarn
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 0.125
llm_load_print_meta: n_yarn_orig_ctx  = 8192
llm_load_print_meta: rope_finetuned   = yes

and

llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.125
llama_new_context_with_model: kv self size  =   64.00 MB

jxy · 2023-11-03T17:33:48Z

metal issue is a simple fix: #3937

FNsi · 2023-11-04T09:21:21Z

Found Mistral 7b yarn 128k has been released,

mistral 7b yarn

(Meanwhile seems 320G vram needed for 128k ctx)

ggerganov · 2023-11-04T11:05:07Z

(Meanwhile seems 320G vram needed for 128k ctx)

More like 16GB. Where do you get this number from?
At lest for vanilla Mistral with 8 KV heads it's 1GB per 8k context

FNsi · 2023-11-04T11:11:36Z

(Meanwhile seems 320G vram needed for 128k ctx)

More like 16GB. Where do you get this number from?

At lest for vanilla Mistral with 8 KV heads it's 1GB per 8k context

According to @bloc97 in model discuss and he's one of the model's member if I am correct.

Green-Sky · 2023-11-05T13:53:11Z

The big number discrepancy probably stems from us not properly implementing mistrals context window shenanigans.

ggerganov · 2023-11-05T15:41:54Z

I could be missing something, but if we implemented the Mistral SWA thing, we would require even less memory

Dampfinchen · 2023-11-06T12:39:11Z

The big number discrepancy probably stems from us not properly implementing mistrals context window shenanigans.

Yes, also I've heard Mistral relies heavily on Sliding Window Attention even for 4K context.

So for best performance, it really should be implemented.

KerfuffleV2 · 2023-11-06T14:42:07Z

So for best performance, it really should be implemented.

If you mean the per-layer stuff, the information to implement it really doesn't exist, and their code examples don't include that. Also, they didn't respond to issues in their repo asking for clarification, so...

bloc97 · 2023-11-06T19:00:08Z

The big number discrepancy probably stems from us not properly implementing mistrals context window shenanigans.

Huggingface and pytorch modeling code is much less VRAM efficient than llamacpp because it has to take in account both training and inference use cases (eg. arbitrarily shaped attention masking), and expose internal values for allowing PEFT training. In these scenarios, the kv-cache is extremely inefficient and the models' internal states are also kept, making inference use a huge amount of VRAM. It is possible to rewrite the Llama and Mistral inference code with custom kernels in pytorch but it would break compatibility with all other features (eg. what is done by Exllama or vLLM).

rope backward process was broken after YaRN RoPE (ggerganov#2268) implementation, due to missing changes in backward functions. the code for the backward process is nearly identically to the forward process: the only difference is the sign of the sin-values. to avoid future regressions remove the near-duplicate backward functions and reuse the forward code: for this a new function argument `bool forward` was added to `ggml_compute_forward_rope_f32` and `ggml_compute_forward_rope_f16`. the sin-values will be negated when forward is false.

* fix backward process of rope rope backward process was broken after YaRN RoPE (#2268) implementation, due to missing changes in backward functions. the code for the backward process is nearly identically to the forward process: the only difference is the sign of the sin-values. to avoid future regressions remove the near-duplicate backward functions and reuse the forward code: for this a new function argument `bool forward` was added to `ggml_compute_forward_rope_f32` and `ggml_compute_forward_rope_f16`. the sin-values will be negated when forward is false. * fix finetune rope call to use correct default attn_factor of 1.0f * remove unused `ggml_rope_xpos_back` it is better to have only one `ggml_rope_back` function that accepts all rope parameters, so that `ggml_compute_backward` can propagate all parameters without having to switch between different rope_back variants. * fix comments explaining the sinus sign in ggml_forward_rope * add missing function arguments in declaration * fix function argument type in declaration

Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Jeffrey Quesnelle <jquesnelle@gmail.com>

…nov#3893)

…ov#3898)

* fix backward process of rope rope backward process was broken after YaRN RoPE (ggerganov#2268) implementation, due to missing changes in backward functions. the code for the backward process is nearly identically to the forward process: the only difference is the sign of the sin-values. to avoid future regressions remove the near-duplicate backward functions and reuse the forward code: for this a new function argument `bool forward` was added to `ggml_compute_forward_rope_f32` and `ggml_compute_forward_rope_f16`. the sin-values will be negated when forward is false. * fix finetune rope call to use correct default attn_factor of 1.0f * remove unused `ggml_rope_xpos_back` it is better to have only one `ggml_rope_back` function that accepts all rope parameters, so that `ggml_compute_backward` can propagate all parameters without having to switch between different rope_back variants. * fix comments explaining the sinus sign in ggml_forward_rope * add missing function arguments in declaration * fix function argument type in declaration

The NeoX cur_rot part is different because I'm pretty sure my original implementation was wrong.

cebtenzzre force-pushed the ntkv2 branch 3 times, most recently from ce59171 to f3b9eae Compare July 19, 2023 03:55

cebtenzzre changed the title ~~llama: implement NTK-By-Parts (NTKv2)~~ llama: implement NTK-By-Parts (NTKv2) RoPE scaling Jul 19, 2023

cebtenzzre force-pushed the ntkv2 branch from f3b9eae to f30c571 Compare July 19, 2023 04:31

cebtenzzre force-pushed the ntkv2 branch from c62b01b to fe2413c Compare July 21, 2023 22:03

cebtenzzre marked this pull request as ready for review July 21, 2023 22:04

ggerganov reviewed Jul 22, 2023

View reviewed changes

ggml.c Outdated Show resolved Hide resolved

cebtenzzre added 3 commits August 7, 2023 12:16

llama: implement NTK-By-Parts (NTKv2) RoPE scaling

8dec38c

CUDA implementation

6aeb46b

Metal implementation

9348aa4

cebtenzzre force-pushed the ntkv2 branch from 2b61001 to 9348aa4 Compare August 7, 2023 16:18

This comment was marked as outdated.

Sign in to view

cebtenzzre mentioned this pull request Sep 4, 2023

Discussion: how to apply this experiment to the llama2 70B model? jquesnelle/yarn#11

Open

implement new YaRN algorithm

a30ae20

cebtenzzre changed the title ~~llama: implement NTK-By-Parts (NTKv2) RoPE scaling~~ llama: implement YaRN RoPE scaling Sep 5, 2023

This comment was marked as resolved.

Sign in to view

cebtenzzre marked this pull request as draft September 6, 2023 15:50

cebtenzzre mentioned this pull request Sep 18, 2023

llama : allow gguf RoPE keys to be overridden with defaults #3240

Merged

cebtenzzre mentioned this pull request Nov 2, 2023

Add ROCM aliases for CUDA pool stuff #3918

Merged

cebtenzzre mentioned this pull request Nov 3, 2023

llama : change yarn_ext_factor placeholder to -1 #3922

Merged

Ph0rk0z mentioned this pull request Nov 3, 2023

Multi-GPU has been broken for me recently. ggml-cuda.cu:7068: invalid argument #3930

Closed

xaedes mentioned this pull request Nov 6, 2023

Fix backward rope after YaRN #3974

Merged

olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this pull request Nov 23, 2023

llama : implement YaRN RoPE scaling (ggerganov#2268)

e8a2ed7

Co-authored-by: cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Jeffrey Quesnelle <jquesnelle@gmail.com>

olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this pull request Nov 23, 2023

llama : fix llama_context_default_params after ggerganov#2268 (ggerga…

5cf4114

…nov#3893)

olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this pull request Nov 23, 2023

cuda : fix RoPE after ggerganov#2268 (ggerganov#3897)

cdd4a93

olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this pull request Nov 23, 2023

metal : fix build errors and kernel sig after ggerganov#2268 (ggergan…

b0c5fe6

…ov#3898)

cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request Nov 23, 2023

vulkan : implement YaRN RoPE scaling (ggerganov#2268)

6b41a3a

The NeoX cur_rot part is different because I'm pretty sure my original implementation was wrong.

crasm mentioned this pull request Nov 23, 2023

Update docs for yarn_ext_factor <0.0 as unspecified instead of NaN #4189

Merged

cebtenzzre added a commit to nomic-ai/llama.cpp that referenced this pull request Nov 23, 2023

vulkan : implement YaRN RoPE scaling (ggerganov#2268)

208cd52

The NeoX cur_rot part is different because I'm pretty sure my original implementation was wrong.

abc-nix mentioned this pull request Jan 21, 2024

Implement customizable RoPE #2054

Merged

cebtenzzre removed the demo Demonstrate some concept or idea, not intended to be merged label Feb 13, 2024

cebtenzzre mentioned this pull request Feb 13, 2024

Add support for BERT embedding models #5423

Merged

ggerganov mentioned this pull request Apr 4, 2024

[User] Implement Streaming LLM - Make the inference more efficient #3440

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama: implement YaRN RoPE scaling #2268

llama: implement YaRN RoPE scaling #2268

cebtenzzre commented Jul 18, 2023 •

edited

FNsi commented Jul 20, 2023 •

edited

cebtenzzre commented Jul 20, 2023

cebtenzzre commented Jul 21, 2023

ggerganov left a comment

ggerganov commented Jul 31, 2023

This comment was marked as outdated.

cebtenzzre commented Aug 14, 2023

cebtenzzre commented Aug 31, 2023 •

edited

This comment was marked as resolved.

bloc97 commented Sep 6, 2023

cebtenzzre commented Sep 6, 2023

KerfuffleV2 commented Sep 6, 2023

Green-Sky commented Sep 6, 2023 •

edited

bloc97 commented Sep 6, 2023 •

edited

cebtenzzre commented Nov 2, 2023 •

edited

KerfuffleV2 commented Nov 3, 2023

jxy commented Nov 3, 2023

jxy commented Nov 3, 2023

FNsi commented Nov 4, 2023 •

edited

ggerganov commented Nov 4, 2023

FNsi commented Nov 4, 2023

Green-Sky commented Nov 5, 2023

ggerganov commented Nov 5, 2023

Dampfinchen commented Nov 6, 2023 •

edited

KerfuffleV2 commented Nov 6, 2023

bloc97 commented Nov 6, 2023 •

edited

llama: implement YaRN RoPE scaling #2268

llama: implement YaRN RoPE scaling #2268

Conversation

cebtenzzre commented Jul 18, 2023 • edited

FNsi commented Jul 20, 2023 • edited

cebtenzzre commented Jul 20, 2023

cebtenzzre commented Jul 21, 2023

ggerganov left a comment

Choose a reason for hiding this comment

ggerganov commented Jul 31, 2023

This comment was marked as outdated.

cebtenzzre commented Aug 14, 2023

cebtenzzre commented Aug 31, 2023 • edited

This comment was marked as resolved.

bloc97 commented Sep 6, 2023

cebtenzzre commented Sep 6, 2023

KerfuffleV2 commented Sep 6, 2023

Green-Sky commented Sep 6, 2023 • edited

bloc97 commented Sep 6, 2023 • edited

cebtenzzre commented Nov 2, 2023 • edited

KerfuffleV2 commented Nov 3, 2023

jxy commented Nov 3, 2023

jxy commented Nov 3, 2023

FNsi commented Nov 4, 2023 • edited

ggerganov commented Nov 4, 2023

FNsi commented Nov 4, 2023

Green-Sky commented Nov 5, 2023

ggerganov commented Nov 5, 2023

Dampfinchen commented Nov 6, 2023 • edited

KerfuffleV2 commented Nov 6, 2023

bloc97 commented Nov 6, 2023 • edited

cebtenzzre commented Jul 18, 2023 •

edited

FNsi commented Jul 20, 2023 •

edited

cebtenzzre commented Aug 31, 2023 •

edited

Green-Sky commented Sep 6, 2023 •

edited

bloc97 commented Sep 6, 2023 •

edited

cebtenzzre commented Nov 2, 2023 •

edited

FNsi commented Nov 4, 2023 •

edited

Dampfinchen commented Nov 6, 2023 •

edited

bloc97 commented Nov 6, 2023 •

edited