Implement customizable RoPE #2054

jxy · 2023-06-30T04:37:10Z

The original RoPE has pre-defined parameters

theta_i = 10000^(−2(i−1)/d), for i in [1, 2, ..., d/2]

Our customizable RoPE, ggml_rope_custom_inplace, uses

theta_i = scale * base^(−2(i−1)/d), for i in [1, 2, ..., d/2]

with the default matches the original

scale = 1.0
base = 10000

The new command line arguments
--rope-freq-base
--rope-freq-scale
set the two new RoPE parameter.

Recent researches show changing these two parameters extends the context limit with minimal loss.

Extending Context to 8K kaiokendev https://kaiokendev.github.io/til#extending-context-to-8k
Extending Context Window of Large Language Models via Positional Interpolation Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian https://arxiv.org/abs/2306.15595
NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. https://www.reddit.com/user/bloc97 https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/

For the bold, try adding the following command line parameters to your favorite model: -c 16384 --rope-freq-base 80000 --rope-freq-scale 0.5

1980Dragon · 2023-06-30T07:09:03Z

Just came across another RoPE adjustment method on Reddit. Thought it might be helpful, so here's the link!
Dynamically Scaled RoPE further increases performance of long context LLaMA with zero fine-tuning

maddes8cht · 2023-06-30T08:28:34Z

This still means i will get better perplexity-performance when usingvthis pr with @TheBloke 's supehot model-variants I guess?

FNsi · 2023-06-30T12:26:15Z

Somehow exciting?

I use my merged 13b model (wizard vicuña + starcoder + superhot 16k)

With that 16k command.

5.3710, 2. 5.7513

Looks reasonable.

And and , what if I want to test 32k or higher, how to set both parameters? Any ideas?

jxy · 2023-06-30T17:58:58Z

The --rope-freq-scale is the same scale used in "superhot/rope interpolation". superhot 8k lora corresponds to --rope-freq-scale 0.25 -c 8192, which is a factor of 4 increase. Similarly superhot 16k lora and longchat 16k corresponds to --rope-freq-scale 0.125 -c 16384, for a factor of 8.

The --rope-freq-base simplifies the "NTK-Aware Scaled RoPE". The base number here corresponds to 10000*alpha**(64/63) using the alpha introduced in the reddit post. I'm not aware of any direct translation of how context length corresponds to the base or alpha. My limited testing with 13B models show a rough quadratic correspondence, C = -0.0652*b*b + 0.862*b + 0.203, for C the factor of context length increase, and b the factor of base increase, roughly

base	effective ctx factor	effective ctx
20000	1.66	3400
26000	2	4096
40000	2.6	5300
57200	3.0	6144

I found base>60000 didn't feel good, though I've no hard numbers to back this up.

Empirically, without fine tune, you could try

-c 4096 --rope-freq-scale 0.83 --rope-freq-base 20000
-c 6144 --rope-freq-scale 0.86 --rope-freq-base 40000
-c 8192 --rope-freq-scale 0.75 --rope-freq-base 57200

With superhot 16k or longchat 13b, perhaps you could try (KV cache alone requires 25GB!!)

-c 32768 --rope-freq-scale 0.125 --rope-freq-base 26000
or dial up base more?

jxy · 2023-06-30T18:04:10Z

I used some numbers posted by @JohannesGaessler, and made changes in scratch0 size in this PR. I can rebase this PR on their PR #2056 if needed.

JohannesGaessler · 2023-06-30T19:12:07Z

I think the numbers that I determined for the VRAM scratch buffer will probably work for the RAM scratch buffer but I would still advise you to be cautious since the two types of scratch buffer work differently (the VRAM scratch buffer has tighter limits).

trap20 · 2023-06-30T19:29:59Z

I tried the -c 4096 --rope-freq-scale 0.83 --rope-freq-base 20000 configuration with the wizardlm-33b-v1.0-uncensored.ggmlv3.q5_K_M.bin model and got ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 490733568, available 482344960). Even with 0 layers offloaded.

guanaco-65B.ggmlv3.q4_K_M.bin works with those settings.

jxy · 2023-06-30T22:22:26Z

I tried the -c 4096 --rope-freq-scale 0.83 --rope-freq-base 20000 configuration with the wizardlm-33b-v1.0-uncensored.ggmlv3.q5_K_M.bin model and got ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 490733568, available 482344960).

I can't reproduce. Is CUDA build different? I can't test CUDA build. The number 482344960 seems to be computed from

{ MODEL_30B, ((size_t) n_ctx / 10ull + 256ull) * MB }

with n_ctx = 2048. Do you use main or other method/executable to run it?

time-less-ness · 2023-06-30T22:30:04Z

I am trying this out, and it is working fine for me so far, though I've only tried:

model: gpt4-alpaca-lora-30b.ggmlv3.q4_0.bin
context: -c 6144 --rope-freq-scale 0.86 --rope-freq-base 40000

I'll keep trying other models and context sizes and seeing how it goes. Not sure if other fixes are included, but this seems to make inference way faster (via fewer long pauses to do CPU-related tasks) on my machine as well. Possibly just due to not having to recompute context as I hit the 2048-byte mark so often?

EDIT: Also working just fine for me:

model: gpt4-alpaca-lora-30b.ggmlv3.q4_0.bin 
context: -c 8192 --rope-freq-scale 0.75 --rope-freq-base 57200

trap20 · 2023-06-30T23:27:57Z

I can't reproduce. Is CUDA build different?[...} Do you use main or other method/executable to run it?

Yes, it's a CUDA build for a 1080ti and I really should have used main for reporting, but didn't. I'll try main tomorrow.

jxy · 2023-07-01T01:31:53Z

I can't reproduce. Is CUDA build different?[...} Do you use main or other method/executable to run it?

Yes, it's a CUDA build for a 1080ti and I really should have used main for reporting, but didn't. I'll try main tomorrow.

If you could give me a stack trace of when MEM_REQ_SCRATCH0 is called, I could try to figure out what is wrong with the CUDA build. Otherwise, I'll see if I can get a system somewhere with cuda.

trap20 · 2023-07-01T09:03:27Z

Can't reproduce the error today, no idea what I did exactly to trigger it...

ggerganov · 2023-07-01T18:29:45Z

This looks great, but similar to #1967 - let's wait for a while before merging.
There are new RoPE scaling techniques popping up by the hour each one better the other. No reason to commit to something just yet

FNsi · 2023-07-02T04:38:17Z

or dial up base more?

Test with my merged 13b vicuña model(wizardvicuña + starcoder Lora + gpt4tools + 16k superhot)

16k With perplexity
Chunks decrease to 20 in 16k

Base 70000 scale 0.4 [1] 5.5564

Base 57200 scale 0.5 [1] 6.7699
base 68000 scale 0.5 [1] 5.3758
Base 70000 scale 0.5 [1] 5.3508
Base 75000 scale 0.5 [1] 5.3529
Base 76000 scale 0.5 [1] 5.3532
Base 78000 scale 0.5 [1] 5.3573
base 80000 scale 0.5 [1] 5.3710
base 84000 scale 0.5 [1] 5.4351
base 100000 scale 0.5 [1] 5.6484
Base 120000 scale 0.5 [1] 5.7999

the chunks decrease while ctx enlarged, that might be the reason for some perplexity problem? but obviously not here.

20k cause the chunks decrease to 16.

20k
Base 68000 scale 0.4 [1] 5.7306
base 70000 scale 0.4 [1] 5.7083
Base 72000 scale 0.4 [1] 5.7550
Base 11000 scale 0.4 [1] 6.2897
Base 150000 scale 0.4 [1] 6.6441
base 100000 scale 0.5 [1] 5.7545
base 110000 scale 0.5 [1] 5.7393
base 120000 scale 0.5 [1] 5.8566

32k

I believe 13b MEM_REQ_EVAL is not enough to test🤷

FNsi · 2023-07-02T06:19:19Z

Running perplexity Openllama 3b with -C 16384 scale 0.5 base 90000

No enough space in the contexts memory pool needed 543758112 available 536870912

13b c 32768 scale 0.25 base 120000
Segment fault needed 108054424 available 1073741824

SlyEcho · 2023-07-02T13:47:14Z

KV cache alone requires 25GB!!

Could we quantize the KV cache?

FNsi · 2023-07-02T13:50:32Z

KV cache alone requires 25GB!!

Could we quantize the KV cache?

Another solution #1955

Btw I just saw
falcon v split

Green-Sky · 2023-07-02T17:31:39Z

KV cache alone requires 25GB!!

Could we quantize the KV cache?

I think this was tried, and resulted in bad results. It should already be in f16.
but i dont remember, if we tried 8bit quantization...

edit: do we use flashattention for the forward pass?

ardfork · 2023-07-02T23:32:30Z

For some reason, the server example output some random unicode characters when using --rope-freq-scale 0.25 -c 8192 but --rope-freq-scale 0.25 -c 4096 work correctly and --rope-freq-scale 0.25 -c 8192 work on cli.

jxy · 2023-07-03T05:04:07Z

server gives me 413 when the json data is large. We need help from those who contributed server code.

SlyEcho · 2023-07-03T07:41:26Z

server gives me 413 when the json data is large. We need help from those who contributed server code.

I believe CPPHTTPLIB_RECV_BUFSIZ needs to be increased, right now it is 4K.

digiwombat · 2023-07-03T16:06:05Z

Yeah SlyEcho is right based on what I saw in the lib, setting
#define CPPHTTPLIB_RECV_BUFSIZ size_t(<SOME NUMBER HERE>)
before the httplib.h import should be the correct way to increase it, I believe.

jxy · 2023-07-03T17:38:08Z

Running perplexity Openllama 3b with -C 16384 scale 0.5 base 90000

No enough space in the contexts memory pool needed 543758112 available 536870912

The only thing that I know of allocating 512 MB (536870912) is from MEM_REQ_EVAL, which this PR didn't change. Maybe try changing the line

{ MODEL_3B, 512ull * MB },

to something like

{ MODEL_3B, 600ull * MB },

and see if it helps?

jxy · 2023-07-04T02:42:19Z

I believe CPPHTTPLIB_RECV_BUFSIZ needs to be increased, right now it is 4K.

It looks like a simple read buffer to me, and it's separate from the overall size limit.

digiwombat · 2023-07-04T02:47:21Z

It looks like a simple read buffer to me, and it's separate from the overall size limit.

Server::set_payload_max_length(uint64_t length) might be what we're after then.

svr.set_payload_max_length(1024 * 1024 * 1); would set it to 1MB (left the 1 in for example purposes)

jxy · 2023-07-04T04:07:01Z

It looks like a simple read buffer to me, and it's separate from the overall size limit.

Server::set_payload_max_length(uint64_t length) might be what we're after then.

svr.set_payload_max_length(1024 * 1024 * 1); would set it to 1MB (left the 1 in for example purposes)

the default is actually

#define CPPHTTPLIB_PAYLOAD_MAX_LENGTH ((std::numeric_limits<size_t>::max)())

The original RoPE has pre-defined parameters theta_i = 10000^(−2(i−1)/d), for i in [1, 2, ..., d/2] Our customizable RoPE, ggml_rope_custom_inplace, uses theta_i = scale * base^(−2(i−1)/d), for i in [1, 2, ..., d/2] with the default matches the original scale = 1.0 base = 10000 The new command line arguments --rope-freq-base --rope-freq-scale set the two new RoPE parameter. Recent researches show changing these two parameters extends the context limit with minimal loss. 1. Extending Context to 8K kaiokendev https://kaiokendev.github.io/til#extending-context-to-8k 2. Extending Context Window of Large Language Models via Positional Interpolation Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian https://arxiv.org/abs/2306.15595 3. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. https://www.reddit.com/user/bloc97 https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/ For the bold, try adding the following command line parameters to your favorite model: -c 16384 --rope-freq-base 80000 --rope-freq-scale 0.5

ggerganov · 2023-07-14T19:06:56Z

What is the latest state of this approach - is it worth merging and supporting?

time-less-ness · 2023-07-14T19:47:36Z

I've been using this on a Mac M1 Max since the PR was raised and it's working fine for me. I've been hoping it will get merged so I can go back to compiling from master again. Really enjoying having 8k context.

SlyEcho · 2023-07-14T20:03:09Z

Let's merge and maybe then improve later.

cebtenzzre · 2023-07-18T00:18:09Z

ggml.c

@@ -15759,7 +15759,7 @@ static void ggml_compute_backward(struct ggml_context * ctx, struct ggml_tensor
            {
                if (src0->grad) {
                    assert(src1->type == GGML_TYPE_I32);
-                    assert(ggml_nelements(src1) == 4);
+                    assert(ggml_nelements(src1) == 3);


Shouldn't this be 6? Based on the code immediately after it should be at least 4, I think, not 3.

I went through the code an I also can't see why it's 3 when the lines just below it show it clearly taking 4 elements and looks like it designed to fail the assertion

Should be fixed in 513f861

LostRuins · 2023-07-18T13:54:42Z

Is there any reason why the following lines are unmodified and still use the hardcoded 10000.0 and 1.0 rope frequency and scale?

https://github.com/ggerganov/llama.cpp/blob/master/ggml-cuda.cu#L2955
https://github.com/ggerganov/llama.cpp/blob/master/ggml.c#L12418
https://github.com/ggerganov/llama.cpp/blob/master/ggml.c#L12517

Green-Sky · 2023-07-18T17:28:38Z

Is there any reason why the following lines are unmodified and still use the hardcoded 10000.0 and 1.0 rope frequency and scale?

for the ggml.c lines, those appear to be the rope backwards passes, confusingly named forward_rope_back

jxy · 2023-07-18T22:25:53Z

I left the backward code untouched because I wasn't sure how I could correctly modify it and test it.

I'm also not sure about cuda bits.

SlyEcho · 2023-07-18T22:34:07Z

The CUDA part is broken right now, it should be fixed.

bilal-aamer · 2024-01-21T15:12:12Z

How do I implement this with RoPE and without it with current LLMs?

abc-nix · 2024-01-21T15:52:57Z

How do I implement this with RoPE and without it with current LLMs?

You can read a bit more about RoPE use in llama.cpp in the llama.cpp/examples/main/README.md

Though I would recommend you try out the new Self-Extend support added in commit #4815 which I think is better, as you don't need to retrain the model to get better results.

bilal-aamer · 2024-01-21T16:36:10Z

How do I implement this with RoPE and without it with current LLMs?

You can read a bit more about RoPE use in llama.cpp in the llama.cpp/examples/main/README.md

Though I would recommend you try out the new Self-Extend support added in commit #4815 which I think is better, as you don't need to retrain the model to get better results.

Thanks @abc-nix!

What about the implementation of customized RoPE

abc-nix · 2024-01-21T17:03:15Z

Sorry, @bilal-aamer, I am not sure what you are trying to ask here.

This PR adds customized RoPE support. Latter, YaRN RoPE scaling was added in PR #2268 and some other fixes were added after that.

main's help has this to say about how the options and parameter to make use of RoPE/YaRN:

  --rope-scaling {none,linear,yarn}
                        RoPE frequency scaling method, defaults to linear unless specified by the model
  --rope-scale N        RoPE context scaling factor, expands context by a factor of N
  --rope-freq-base N    RoPE base frequency, used by NTK-aware scaling (default: loaded from model)
  --rope-freq-scale N   RoPE frequency scaling factor, expands context by a factor of 1/N
  --yarn-orig-ctx N     YaRN: original context size of model (default: 0 = model training context size)
  --yarn-ext-factor N   YaRN: extrapolation mix factor (default: 1.0, 0.0 = full interpolation)
  --yarn-attn-factor N  YaRN: scale sqrt(t) or attention magnitude (default: 1.0)
  --yarn-beta-slow N    YaRN: high correction dim or alpha (default: 1.0)
  --yarn-beta-fast N    YaRN: low correction dim or beta (default: 32.0)

I am not sure what you are trying to achieve or what exactly you are asking. Hopefully someone else isn't as obtuse as me and can help you out.

madiarabis · 2024-03-29T20:25:56Z

Is there any documentation on how to implement this or an example? I am kind of new in the field and I am fine tuning code llama 2 and I want to increase the context length. But between all these posts I am sort of confused how to implement it actually.

This is my implementation:
accelerate launch --config_file "./fsdp_config.yaml" fsdp_acc2.py
--rope_scaling 0.25

this is the error I am getting:
RuntimeError: The size of tensor a (16384) must match the size of tensor b (16385) at non-singleton dimension

jxy mentioned this pull request Jun 30, 2023

Test-based VRAM scratch size + context adjustment #2056

Merged

jxy force-pushed the custom_rope branch from 12e23c0 to a6e392a Compare July 3, 2023 04:43

jxy force-pushed the custom_rope branch from a6e392a to b2fda18 Compare July 4, 2023 02:35

jxy added 2 commits July 13, 2023 19:52

Merge remote-tracking branch 'upstream/master' into custom_rope

a6b5695

Merge branch 'custom_rope' of github.com:jxy/llama.cpp into custom_rope

da730c5

style : minor fixes, mostly indentations

d0b6c94

ggerganov force-pushed the custom_rope branch from f487d5e to d0b6c94 Compare July 15, 2023 07:00

ggml : fix asserts

6024bcc

ggerganov approved these changes Jul 15, 2023

View reviewed changes

ggerganov merged commit 6e7cca4 into ggerganov:master Jul 15, 2023
24 checks passed

philpax mentioned this pull request Jul 17, 2023

Implement SuperHOT/interpolated RoPE support rustformers/llm#378

Closed

cebtenzzre reviewed Jul 18, 2023

View reviewed changes

FNsi mentioned this pull request Jul 18, 2023

[User] interactive mode with --multiline-input cannot input more than around 13000 byte words by one time. #2259

Closed

cebtenzzre mentioned this pull request Jul 18, 2023

llama: implement YaRN RoPE scaling #2268

Merged

5 tasks

ikawrakow mentioned this pull request Jul 20, 2023

Custom RoPE + bettter memory management for CUDA #2295

Merged

ggerganov added a commit that referenced this pull request Jul 21, 2023

ggml : fix rope args order + assert (#2054)

513f861

maddes8cht mentioned this pull request Jul 26, 2023

Lack of documentation regarding RoPE scaling #2402

Closed

klosax mentioned this pull request Aug 7, 2023

Request support for LLaMA-2-7B-32K #2530

Closed

SabinStargem mentioned this pull request Aug 7, 2023

[FEATURE] (v1.3.9) - ROPE calculator in launcher, please. LostRuins/koboldcpp#375

Closed

Flanua mentioned this pull request Sep 7, 2023

Update max_new_tokens slider in web gui to support at least 16k oobabooga/text-generation-webui#3835

Closed

jxy deleted the custom_rope branch April 10, 2024 02:45

Implement customizable RoPE #2054

Implement customizable RoPE #2054

Conversation

jxy commented Jun 30, 2023

1980Dragon commented Jun 30, 2023

maddes8cht commented Jun 30, 2023

FNsi commented Jun 30, 2023 • edited

jxy commented Jun 30, 2023 • edited

jxy commented Jun 30, 2023

JohannesGaessler commented Jun 30, 2023

trap20 commented Jun 30, 2023 • edited

jxy commented Jun 30, 2023 • edited

time-less-ness commented Jun 30, 2023 • edited

trap20 commented Jun 30, 2023

jxy commented Jul 1, 2023

trap20 commented Jul 1, 2023

ggerganov commented Jul 1, 2023

FNsi commented Jul 2, 2023 • edited

FNsi commented Jul 2, 2023 • edited

SlyEcho commented Jul 2, 2023

FNsi commented Jul 2, 2023 • edited

Green-Sky commented Jul 2, 2023 • edited

ardfork commented Jul 2, 2023

jxy commented Jul 3, 2023

SlyEcho commented Jul 3, 2023

digiwombat commented Jul 3, 2023

jxy commented Jul 3, 2023

jxy commented Jul 4, 2023

digiwombat commented Jul 4, 2023

jxy commented Jul 4, 2023

ggerganov commented Jul 14, 2023

time-less-ness commented Jul 14, 2023

SlyEcho commented Jul 14, 2023

cebtenzzre Jul 18, 2023

Choose a reason for hiding this comment

DifferentialityDevelopment Jul 20, 2023

Choose a reason for hiding this comment

cebtenzzre Jul 20, 2023

Choose a reason for hiding this comment

ggerganov Jul 21, 2023

Choose a reason for hiding this comment

LostRuins commented Jul 18, 2023

Green-Sky commented Jul 18, 2023

jxy commented Jul 18, 2023

SlyEcho commented Jul 18, 2023

bilal-aamer commented Jan 21, 2024

abc-nix commented Jan 21, 2024

bilal-aamer commented Jan 21, 2024

abc-nix commented Jan 21, 2024

madiarabis commented Mar 29, 2024 • edited

FNsi commented Jun 30, 2023 •

edited

jxy commented Jun 30, 2023 •

edited

trap20 commented Jun 30, 2023 •

edited

jxy commented Jun 30, 2023 •

edited

time-less-ness commented Jun 30, 2023 •

edited

FNsi commented Jul 2, 2023 •

edited

FNsi commented Jul 2, 2023 •

edited

FNsi commented Jul 2, 2023 •

edited

Green-Sky commented Jul 2, 2023 •

edited

madiarabis commented Mar 29, 2024 •

edited