Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip computation of much of last layer & unused logits during prompt eval / large N #2700

Closed
wants to merge 15 commits into from

Conversation

ochafik
Copy link
Collaborator

@ochafik ochafik commented Aug 22, 2023

(Bit-rotten since multimodel support &) Superseded by #6122 🎉

TL;DR: the speedup is modest but consistent on CPU (~4% for 7B model, 15-25% for TinyLlamas models on M2 Mac) for a rather simple change

During prompt eval (w/ N > 1), a lot of what is done at the last layer is wasted unless we want all the logits out: it's fine to focus on the last token once that layer finished writing to the kv-cache.

Concretely, this means we can drop all but the last column of cur and inpSA after writing to the KV cache at the last layer, unless logits_all is set (I've moved tmpq & Qcur to right after that narrowing reshape).

Some notes on performance:

  • Squeezing eval time will only make a difference when the prompt is much larger than the generation.
  • The more layers, the least benefits
  • Inconclusive results on CUDA, not sure how to get stable results w/o dedicated hardware (tried on this Colab)
  • This won't make perplexity computation any faster (all logits are needed there)
  • Technically could skip more operations when the eval is chunked in batches (since only the last logit of the last chunk is needed), but there's vastly diminishing returns here

To test this branch:

git clone https://github.com/ochafik/llama.cpp ochafik-llama.cpp
cd ochafik-llama.cpp
git checkout skip-unused-2 && git pull

cmake -B ../build-skip . && ( cd ../build-skip && make -j main ) 
cmake -B ../build-noskip . -DLLAMA_SKIP_UNUSED_LOGITS=0  && ( cd ../build-noskip && make -j main ) 

# Edit to point to your model's path
hyperfine --warmup 1 \
  -L build noskip,skip \
  '../build-{build}/bin/main -m llama-2-7b-chat.ggmlv3.q2_K.bin --temp 0 -n 1 -f prompts/reason-act.txt -ngl 0'
# set -ngl 1 for Metal / CUDA

@KerfuffleV2
Copy link
Collaborator

This doesn't seem compatible with #1087 (the ROCM port). I just get garbage output trying to evaluate a model. Doesn't seem to affect the perplexity tool (as expected).

I tested with CPU only and CLBlast - seems to work fine with those. (Only tested evaluation.)

@KerfuffleV2
Copy link
Collaborator

Is there something specific with the ROCM stuff that just makes it impossible to use this approach? I was just mentioning it, wasn't sure if the issue was with this pull or the ROCM one or whether there were changes that could fix it.

@ochafik
Copy link
Collaborator Author

ochafik commented Aug 23, 2023

@KerfuffleV2 Thanks a lot for testing it out! I've now disabled this when LLAMA_USE_HIPBLAS is defined, but after glancing through that PR I don't see any reason why it wouldn't work, although I have very low understanding of anything CUDA- or ROCm-related.

@ggerganov Do you have existing tricks to compare intermediate outputs between backends? (some way to force synchronous evaluation of each operation maybe, e.g. w/ its own 1-node graph & compute call; maybe appending each tensor to a binary file then reading it / comparing it on the fly?). That would allow spotting at which exact op garbage starts appearing (could potentially reveal an underlying bug in the ROCm code if it's the only backend causing issues).

If there's interest in shipping this PR (which is arguably borderline useful given low single-digit speedup) I'll try and rent a cloud ROCm instance to debug 🤗

@KerfuffleV2
Copy link
Collaborator

@SlyEcho Any idea of what could be going on here? #2700 (comment)

@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented Aug 24, 2023

Should not be related to AMD, will have to test CUDA as well.

I suspect something about the shape of the tensors is not suited to our optimized kernels.

llama.cpp Outdated Show resolved Hide resolved
@Engininja2
Copy link
Contributor

I tried the collab with some -ngl values smaller than 33 and they produce garbage for output.

@ardfork
Copy link
Contributor

ardfork commented Aug 24, 2023

Yes, I wanted to check what the problem was on ROCm, but it's the same if fully offloading it work but if offloading a few layers it doesn't.

@ochafik
Copy link
Collaborator Author

ochafik commented Aug 25, 2023

@Engininja2 @ardfork Thanks, great to be able to repro the issue!

@KerfuffleV2 @SlyEcho This PR's definitely broken on CUDA independently of the ROCm changes when not all layers are offloaded to the GPU. I don't understand the offload_func_* logic yet, reckon I'll need to update it.

@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented Aug 25, 2023

I suspect it's got something to do with copying memory between device and host.
I think @JohannesGaessler and @slaren would know better.

@ochafik
Copy link
Collaborator Author

ochafik commented Aug 25, 2023

Pushed a naive tweak that seems to fix the output for all ngl values... except -ngl 1 😓. I'll keep digging but might have to disable this for CUDA altogether.

Also @SlyEcho I've moved the define to the makefile / CMakeLists.txt as you suggested.

llama.cpp Outdated
Comment on lines 2324 to 2325
if (il == n_layer - 1 && !lctx.logits_all)
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not follow the coding guidelines in the README.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

llama.cpp Outdated
offload_func_t offload_func_nr = llama_nop; // nr = non-repeating
offload_func_t offload_func_kq = llama_nop;
offload_func_t offload_func_v = llama_nop;
offload_func_t offload_func_skip = llama_nop;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you defining offload_func_skip? If I understand your code correctly it works by discarding the unneeded parts of tensors in the last layer. I don't understand why this would affect the offloading logic, i.e. whether data should be stored in RAM or VRAM.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping offload_func_skip being distinct would help things, but there seems to be some entangling anyway.

Not sure I understand the interplay between ggml_cuda_compute_forward and the CPU fallback route, but I've noticed the former skips GGML_OP_VIEW if its input isn't on the GPU, so I've now tried to make sure the offload of cur and inpSA's subviews matches whatever backend they currently were using the following defensive code, and... it seems to fix the issue:

auto cur_on_gpu = cur->backend == GGML_BACKEND_GPU;
cur   = ggml_view_2d(ctx0, cur,   n_embd, 1,   cur->nb[1], (N - 1)*ggml_element_size(cur)*n_embd);
if (cur_on_gpu) ggml_cuda_assign_buffers_no_alloc(cur);

auto inpSA_on_gpu = inpSA->backend == GGML_BACKEND_GPU;
inpSA = ggml_view_2d(ctx0, inpSA, n_embd, 1, inpSA->nb[1], (N - 1)*ggml_element_size(inpSA)*n_embd);
if (inpSA_on_gpu) ggml_cuda_assign_buffers_no_alloc(inpSA);

The issue being when -ngl 1, i_gpu_start = n_layer - 1 so the last layer is the only one that offloads most of its tensors (as per the if (il >= i_gpu_start) { offload_func = ggml_cuda_assign_buffers_no_alloc; } block).

Another simpler fix (pushed on this PR) that also seems to work is to offload the view inputs (but that may be offloading more than intended?):

offload_func(cur);
cur   = ggml_view_2d(ctx0, cur,   n_embd, 1,   cur->nb[1], (N - 1)*ggml_element_size(cur)*n_embd);
offload_func(cur);

offload_func(inpSA);
inpSA = ggml_view_2d(ctx0, inpSA, n_embd, 1, inpSA->nb[1], (N - 1)*ggml_element_size(inpSA)*n_embd);
offload_func(inpSA);

Will try to find nicer ways to fix this (wondering if the CPU view fallback in a mostly GPU context is behaving as expected?), suggestions welcome 🙂

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the interplay between ggml_cuda_compute_forward and the CPU fallback route

The decision for whether or not CUDA should be used is based on data location. If the data of any of the tensors is in VRAM then all of the calculations are done on the GPU and the data is copied to/from VRAM as necessary based on backend. My preferred design would have been to add a state to the ggml context that sets the backend for all newly created tensors. But unfortunately this way vetoed by Georgi who wanted to avoid GPU offloading logic in ggml. So the current design is kind of cancerous where you have to manually offload each tensor. For tensors that don't actually do any computations this is very awkward because there is no step where the data would be carried over if necessary. Maybe I should add an explicit check for those tensors to ensure that the backends are consistent.

@JohannesGaessler
Copy link
Collaborator

I pushed a fix that should have the correct offloading logic here.

@SlyEcho
Copy link
Sponsor Collaborator

SlyEcho commented Aug 27, 2023

If it works all the time everywhere I don't think there needs to be a compile option. But if it stays it should be enabled by default for most users.

@ochafik
Copy link
Collaborator Author

ochafik commented Aug 28, 2023

@JohannesGaessler thanks so much for looking into it! I still seem to be getting wrong output with that change though (still only broken for -ngl 1, see colab 😓).

Edit: found the issue, see review thread

@JohannesGaessler
Copy link
Collaborator

Good catch with -ngl 1, it seems that only that exact value was broken.

I pushed more changes to my own repository. The changes are an extra check to the CUDA code to let you safely offload the same tensor multiple times (which I think currently can happen) and a removal of the first call to offload_func: since the preceding tensor is in the same layer it should always already be offloaded.

@ochafik
Copy link
Collaborator Author

ochafik commented Aug 30, 2023

@JohannesGaessler thanks a lot, I've merged your changes! (sorry for the delay, faced some mysterious segfaults on Colab, seems to have been fixed by merging master again 😅)

Makefile Outdated
@@ -441,6 +441,15 @@ k_quants.o: k_quants.c k_quants.h
$(CC) $(CFLAGS) -c $< -o $@
endif # LLAMA_NO_K_QUANTS

ifndef LLAMA_NO_SKIP_UNUSED_LOGITS
CFLAGS += -DLLAMA_SKIP_UNUSED_LOGITS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't put indented assignments below a build rule like this. The ifdef doesn't make a difference syntactically; this line is treated as part of the k-quants build.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks, I'd overlooked 4dcd47d (and hadn't merged it well), fixed.

@ggerganov ggerganov added the demo Demonstrate some concept or idea, not intended to be merged label Sep 14, 2023
@ochafik
Copy link
Collaborator Author

ochafik commented Sep 21, 2023

Updated the description w/ CPU speedup figures for TinyLlamas models (15-25% faster)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
demo Demonstrate some concept or idea, not intended to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants