Skip computation of much of last layer & unused logits during prompt eval / large N #2700

ochafik · 2023-08-22T00:27:11Z

(Bit-rotten since multimodel support &) Superseded by #6122 🎉

TL;DR: the speedup is modest but consistent on CPU (~4% for 7B model, 15-25% for TinyLlamas models on M2 Mac) for a rather simple change

During prompt eval (w/ N > 1), a lot of what is done at the last layer is wasted unless we want all the logits out: it's fine to focus on the last token once that layer finished writing to the kv-cache.

Concretely, this means we can drop all but the last column of cur and inpSA after writing to the KV cache at the last layer, unless logits_all is set (I've moved tmpq & Qcur to right after that narrowing reshape).

Some notes on performance:

Squeezing eval time will only make a difference when the prompt is much larger than the generation.
The more layers, the least benefits
- Zero benefit for 70B models
- 4-5% faster eval of 7B models on CPU, and 3% on Metal (M2 Max)
- TinyLlamas 15M, 42M, 110M models get CPU eval speedups of 15%, 20%, 25% (test instructions) (FWIW I initially played with this idea on the lean llama2.c's codebase, where it's easier to understand what is being skipped than w/ llama.cpp's tensor batch formulation: Faster prompt eval w/ early exit after last layer's kv cache write karpathy/llama2.c#253)
Inconclusive results on CUDA, not sure how to get stable results w/o dedicated hardware (tried on this Colab)
This won't make perplexity computation any faster (all logits are needed there)
Technically could skip more operations when the eval is chunked in batches (since only the last logit of the last chunk is needed), but there's vastly diminishing returns here

To test this branch:

git clone https://github.com/ochafik/llama.cpp ochafik-llama.cpp
cd ochafik-llama.cpp
git checkout skip-unused-2 && git pull

cmake -B ../build-skip . && ( cd ../build-skip && make -j main ) 
cmake -B ../build-noskip . -DLLAMA_SKIP_UNUSED_LOGITS=0  && ( cd ../build-noskip && make -j main ) 

# Edit to point to your model's path
hyperfine --warmup 1 \
  -L build noskip,skip \
  '../build-{build}/bin/main -m llama-2-7b-chat.ggmlv3.q2_K.bin --temp 0 -n 1 -f prompts/reason-act.txt -ngl 0'
# set -ngl 1 for Metal / CUDA

KerfuffleV2 · 2023-08-23T08:49:01Z

This doesn't seem compatible with #1087 (the ROCM port). I just get garbage output trying to evaluate a model. Doesn't seem to affect the perplexity tool (as expected).

I tested with CPU only and CLBlast - seems to work fine with those. (Only tested evaluation.)

KerfuffleV2 · 2023-08-23T20:40:57Z

Is there something specific with the ROCM stuff that just makes it impossible to use this approach? I was just mentioning it, wasn't sure if the issue was with this pull or the ROCM one or whether there were changes that could fix it.

ochafik · 2023-08-23T20:59:51Z

@KerfuffleV2 Thanks a lot for testing it out! I've now disabled this when LLAMA_USE_HIPBLAS is defined, but after glancing through that PR I don't see any reason why it wouldn't work, although I have very low understanding of anything CUDA- or ROCm-related.

@ggerganov Do you have existing tricks to compare intermediate outputs between backends? (some way to force synchronous evaluation of each operation maybe, e.g. w/ its own 1-node graph & compute call; maybe appending each tensor to a binary file then reading it / comparing it on the fly?). That would allow spotting at which exact op garbage starts appearing (could potentially reveal an underlying bug in the ROCm code if it's the only backend causing issues).

If there's interest in shipping this PR (which is arguably borderline useful given low single-digit speedup) I'll try and rent a cloud ROCm instance to debug 🤗

KerfuffleV2 · 2023-08-23T23:45:01Z

@SlyEcho Any idea of what could be going on here? #2700 (comment)

SlyEcho · 2023-08-24T07:46:52Z

Should not be related to AMD, will have to test CUDA as well.

I suspect something about the shape of the tensors is not suited to our optimized kernels.

llama.cpp

…r batch positions after writing their kv to cache)

Engininja2 · 2023-08-24T18:51:42Z

I tried the collab with some -ngl values smaller than 33 and they produce garbage for output.

ardfork · 2023-08-24T19:09:24Z

Yes, I wanted to check what the problem was on ROCm, but it's the same if fully offloading it work but if offloading a few layers it doesn't.

ochafik · 2023-08-25T10:09:00Z

@Engininja2 @ardfork Thanks, great to be able to repro the issue!

@KerfuffleV2 @SlyEcho This PR's definitely broken on CUDA independently of the ROCm changes when not all layers are offloaded to the GPU. I don't understand the offload_func_* logic yet, reckon I'll need to update it.

SlyEcho · 2023-08-25T11:32:56Z

I suspect it's got something to do with copying memory between device and host.
I think @JohannesGaessler and @slaren would know better.

cmake -DLLAMA_SKIP_UNUSED_LOGITS=OFF ... LLAMA_NO_SKIP_UNUSED_LOGITS=1 make ...

ochafik · 2023-08-25T16:01:03Z

Pushed a naive tweak that seems to fix the output for all ngl values... except -ngl 1 😓. I'll keep digging but might have to disable this for CUDA altogether.

Also @SlyEcho I've moved the define to the makefile / CMakeLists.txt as you suggested.

JohannesGaessler · 2023-08-27T08:26:50Z

llama.cpp

+            if (il == n_layer - 1 && !lctx.logits_all)
+            {


This does not follow the coding guidelines in the README.

Thanks, fixed.

JohannesGaessler · 2023-08-27T08:33:15Z

llama.cpp

+    offload_func_t offload_func_nr   = llama_nop; // nr = non-repeating
+    offload_func_t offload_func_kq   = llama_nop;
+    offload_func_t offload_func_v    = llama_nop;
+    offload_func_t offload_func_skip = llama_nop;


Why are you defining offload_func_skip? If I understand your code correctly it works by discarding the unneeded parts of tensors in the last layer. I don't understand why this would affect the offloading logic, i.e. whether data should be stored in RAM or VRAM.

I was hoping offload_func_skip being distinct would help things, but there seems to be some entangling anyway.

Not sure I understand the interplay between ggml_cuda_compute_forward and the CPU fallback route, but I've noticed the former skips GGML_OP_VIEW if its input isn't on the GPU, so I've now tried to make sure the offload of cur and inpSA's subviews matches whatever backend they currently were using the following defensive code, and... it seems to fix the issue:

auto cur_on_gpu = cur->backend == GGML_BACKEND_GPU; cur = ggml_view_2d(ctx0, cur, n_embd, 1, cur->nb[1], (N - 1)*ggml_element_size(cur)*n_embd); if (cur_on_gpu) ggml_cuda_assign_buffers_no_alloc(cur); auto inpSA_on_gpu = inpSA->backend == GGML_BACKEND_GPU; inpSA = ggml_view_2d(ctx0, inpSA, n_embd, 1, inpSA->nb[1], (N - 1)*ggml_element_size(inpSA)*n_embd); if (inpSA_on_gpu) ggml_cuda_assign_buffers_no_alloc(inpSA);

The issue being when -ngl 1, i_gpu_start = n_layer - 1 so the last layer is the only one that offloads most of its tensors (as per the if (il >= i_gpu_start) { offload_func = ggml_cuda_assign_buffers_no_alloc; } block).

Another simpler fix (pushed on this PR) that also seems to work is to offload the view inputs (but that may be offloading more than intended?):

offload_func(cur); cur = ggml_view_2d(ctx0, cur, n_embd, 1, cur->nb[1], (N - 1)*ggml_element_size(cur)*n_embd); offload_func(cur); offload_func(inpSA); inpSA = ggml_view_2d(ctx0, inpSA, n_embd, 1, inpSA->nb[1], (N - 1)*ggml_element_size(inpSA)*n_embd); offload_func(inpSA);

Will try to find nicer ways to fix this (wondering if the CPU view fallback in a mostly GPU context is behaving as expected?), suggestions welcome 🙂

Not sure I understand the interplay between ggml_cuda_compute_forward and the CPU fallback route

The decision for whether or not CUDA should be used is based on data location. If the data of any of the tensors is in VRAM then all of the calculations are done on the GPU and the data is copied to/from VRAM as necessary based on backend. My preferred design would have been to add a state to the ggml context that sets the backend for all newly created tensors. But unfortunately this way vetoed by Georgi who wanted to avoid GPU offloading logic in ggml. So the current design is kind of cancerous where you have to manually offload each tensor. For tensors that don't actually do any computations this is very awkward because there is no step where the data would be carried over if necessary. Maybe I should add an explicit check for those tensors to ensure that the backends are consistent.

JohannesGaessler · 2023-08-27T09:27:59Z

I pushed a fix that should have the correct offloading logic here.

SlyEcho · 2023-08-27T14:18:02Z

If it works all the time everywhere I don't think there needs to be a compile option. But if it stays it should be enabled by default for most users.

ochafik · 2023-08-28T13:21:10Z

@JohannesGaessler thanks so much for looking into it! I still seem to be getting wrong output with that change though (still only broken for -ngl 1, see colab 😓).

Edit: found the issue, see review thread

…consistently

JohannesGaessler · 2023-08-28T15:50:36Z

Good catch with -ngl 1, it seems that only that exact value was broken.

I pushed more changes to my own repository. The changes are an extra check to the CUDA code to let you safely offload the same tensor multiple times (which I think currently can happen) and a removal of the first call to offload_func: since the preceding tensor is in the same layer it should always already be offloaded.

…ip-unused-2

ochafik · 2023-08-30T17:09:51Z

@JohannesGaessler thanks a lot, I've merged your changes! (sorry for the delay, faced some mysterious segfaults on Colab, seems to have been fixed by merging master again 😅)

cebtenzzre · 2023-09-12T02:05:12Z

Makefile

@@ -441,6 +441,15 @@ k_quants.o: k_quants.c k_quants.h
 	$(CC) $(CFLAGS) -c $< -o $@
 endif # LLAMA_NO_K_QUANTS

+ifndef LLAMA_NO_SKIP_UNUSED_LOGITS
+	CFLAGS   += -DLLAMA_SKIP_UNUSED_LOGITS


You can't put indented assignments below a build rule like this. The ifdef doesn't make a difference syntactically; this line is treated as part of the k-quants build.

Ah thanks, I'd overlooked 4dcd47d (and hadn't merged it well), fixed.

(and also fix bad merge)

ochafik · 2023-09-21T19:31:03Z

Updated the description w/ CPU speedup figures for TinyLlamas models (15-25% faster)

SlyEcho reviewed Aug 24, 2023

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

ochafik added 2 commits August 24, 2023 12:20

Skip computation of unused logits during batch prompt eval (drop othe…

2cf4f62

…r batch positions after writing their kv to cache)

skip-unused: disable skipping on ROCm / when LLAMA_USE_HIPBLAS

7ec7ef9

ochafik force-pushed the skip-unused-2 branch from 5ee8597 to 7ec7ef9 Compare August 24, 2023 11:21

ochafik added 2 commits August 25, 2023 14:00

Allow disabling unused logit skipping code w/ cmake / make options

5553820

cmake -DLLAMA_SKIP_UNUSED_LOGITS=OFF ... LLAMA_NO_SKIP_UNUSED_LOGITS=1 make ...

Tweak GPU offload when skipping unused logits computations

3be6e8d

Merge branch 'master' into skip-unused-2

e23fa92

JohannesGaessler reviewed Aug 27, 2023

View reviewed changes

fix offloading logic

21df40d

ochafik and others added 4 commits August 28, 2023 14:27

skip-unused: fix brackets & tabs

2eaeb7e

skip-unused: revert extra spaces

f6a446e

skip-unused: fix -ngl=1 case by ensure input & of view are offloaded …

9f5b781

…consistently

Fix multiple offloading

e9e8ac4

ochafik added 2 commits August 30, 2023 16:00

Merge remote-tracking branch 'origin/master' into skip-unused-2

d547e05

Merge remote-tracking branch 'JohannesGaessler/skip-unused-2' into sk…

4974f37

…ip-unused-2

Merge remote-tracking branch 'origin/master' into skip-unused-2

646caf9

cebtenzzre reviewed Sep 12, 2023

View reviewed changes

ggerganov added the demo Demonstrate some concept or idea, not intended to be merged label Sep 14, 2023

ochafik added 2 commits September 21, 2023 00:16

Merge remote-tracking branch 'origin/master' into skip-unused-2

55d49b2

Makefile: move unused logits flags where they don't interfere w/ targets

58bb7d5

(and also fix bad merge)

compilade mentioned this pull request Mar 17, 2024

llama : greatly reduce output buffer memory usage #6122

Merged

10 tasks

ochafik closed this Mar 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip computation of much of last layer & unused logits during prompt eval / large N #2700

Skip computation of much of last layer & unused logits during prompt eval / large N #2700

ochafik commented Aug 22, 2023 •

edited

KerfuffleV2 commented Aug 23, 2023

KerfuffleV2 commented Aug 23, 2023

ochafik commented Aug 23, 2023

KerfuffleV2 commented Aug 23, 2023

SlyEcho commented Aug 24, 2023

Engininja2 commented Aug 24, 2023

ardfork commented Aug 24, 2023

ochafik commented Aug 25, 2023

SlyEcho commented Aug 25, 2023

ochafik commented Aug 25, 2023

JohannesGaessler Aug 27, 2023

ochafik Aug 28, 2023

JohannesGaessler Aug 27, 2023

ochafik Aug 28, 2023

JohannesGaessler Aug 28, 2023

JohannesGaessler commented Aug 27, 2023

SlyEcho commented Aug 27, 2023

ochafik commented Aug 28, 2023 •

edited

JohannesGaessler commented Aug 28, 2023

ochafik commented Aug 30, 2023

cebtenzzre Sep 12, 2023

ochafik Sep 20, 2023

ochafik commented Sep 21, 2023

Skip computation of much of last layer & unused logits during prompt eval / large N #2700

Skip computation of much of last layer & unused logits during prompt eval / large N #2700

Conversation

ochafik commented Aug 22, 2023 • edited

KerfuffleV2 commented Aug 23, 2023

KerfuffleV2 commented Aug 23, 2023

ochafik commented Aug 23, 2023

KerfuffleV2 commented Aug 23, 2023

SlyEcho commented Aug 24, 2023

Engininja2 commented Aug 24, 2023

ardfork commented Aug 24, 2023

ochafik commented Aug 25, 2023

SlyEcho commented Aug 25, 2023

ochafik commented Aug 25, 2023

JohannesGaessler Aug 27, 2023

Choose a reason for hiding this comment

ochafik Aug 28, 2023

Choose a reason for hiding this comment

JohannesGaessler Aug 27, 2023

Choose a reason for hiding this comment

ochafik Aug 28, 2023

Choose a reason for hiding this comment

JohannesGaessler Aug 28, 2023

Choose a reason for hiding this comment

JohannesGaessler commented Aug 27, 2023

SlyEcho commented Aug 27, 2023

ochafik commented Aug 28, 2023 • edited

JohannesGaessler commented Aug 28, 2023

ochafik commented Aug 30, 2023

cebtenzzre Sep 12, 2023

Choose a reason for hiding this comment

ochafik Sep 20, 2023

Choose a reason for hiding this comment

ochafik commented Sep 21, 2023

ochafik commented Aug 22, 2023 •

edited

ochafik commented Aug 28, 2023 •

edited