-
Notifications
You must be signed in to change notification settings - Fork 8.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip computation of much of last layer & unused logits during prompt eval / large N #2700
Closed
Closed
Changes from 5 commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
2cf4f62
Skip computation of unused logits during batch prompt eval (drop othe…
ochafik 7ec7ef9
skip-unused: disable skipping on ROCm / when LLAMA_USE_HIPBLAS
ochafik 5553820
Allow disabling unused logit skipping code w/ cmake / make options
ochafik 3be6e8d
Tweak GPU offload when skipping unused logits computations
ochafik e23fa92
Merge branch 'master' into skip-unused-2
ochafik 21df40d
fix offloading logic
JohannesGaessler 2eaeb7e
skip-unused: fix brackets & tabs
ochafik f6a446e
skip-unused: revert extra spaces
ochafik 9f5b781
skip-unused: fix -ngl=1 case by ensure input & of view are offloaded …
ochafik e9e8ac4
Fix multiple offloading
JohannesGaessler d547e05
Merge remote-tracking branch 'origin/master' into skip-unused-2
ochafik 4974f37
Merge remote-tracking branch 'JohannesGaessler/skip-unused-2' into sk…
ochafik 646caf9
Merge remote-tracking branch 'origin/master' into skip-unused-2
ochafik 55d49b2
Merge remote-tracking branch 'origin/master' into skip-unused-2
ochafik 58bb7d5
Makefile: move unused logits flags where they don't interfere w/ targets
ochafik File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2156,7 +2156,8 @@ static struct ggml_cgraph * llm_build_llama( | |
|
||
GGML_ASSERT((!tokens && embd) || (tokens && !embd)); // NOLINT | ||
|
||
const int N = n_tokens; | ||
// Non-const to allow short-circuiting to the last token in the last layer in prompt eval mode. | ||
int N = n_tokens; | ||
|
||
const auto & model = lctx.model; | ||
const auto & hparams = model.hparams; | ||
|
@@ -2229,9 +2230,10 @@ static struct ggml_cgraph * llm_build_llama( | |
// | ||
// with the low VRAM option VRAM scratch is disabled in llama_load_model_internal | ||
// in that case ggml_cuda_assign_buffers has no effect | ||
offload_func_t offload_func_nr = llama_nop; // nr = non-repeating | ||
offload_func_t offload_func_kq = llama_nop; | ||
offload_func_t offload_func_v = llama_nop; | ||
offload_func_t offload_func_nr = llama_nop; // nr = non-repeating | ||
offload_func_t offload_func_kq = llama_nop; | ||
offload_func_t offload_func_v = llama_nop; | ||
offload_func_t offload_func_skip = llama_nop; | ||
|
||
#ifdef GGML_USE_CUBLAS | ||
if (n_gpu_layers > n_layer) { | ||
|
@@ -2243,6 +2245,9 @@ static struct ggml_cgraph * llm_build_llama( | |
if (n_gpu_layers > n_layer + 2) { | ||
offload_func_kq = ggml_cuda_assign_buffers_no_alloc; | ||
} | ||
if (n_gpu_layers > 0) { | ||
offload_func_skip = ggml_cuda_assign_buffers_no_alloc; | ||
} | ||
#endif // GGML_USE_CUBLAS | ||
|
||
struct ggml_tensor * KQ_scale = ggml_new_tensor_1d(ctx0, GGML_TYPE_F32, 1); | ||
|
@@ -2284,18 +2289,10 @@ static struct ggml_cgraph * llm_build_llama( | |
offload_func_kq(tmpk); | ||
ggml_set_name(tmpk, "tmpk"); | ||
|
||
struct ggml_tensor * tmpq = ggml_mul_mat(ctx0, model.layers[il].wq, cur); | ||
offload_func_kq(tmpq); | ||
ggml_set_name(tmpq, "tmpq"); | ||
|
||
struct ggml_tensor * Kcur = ggml_rope_custom_inplace(ctx0, ggml_reshape_3d(ctx0, tmpk, n_embd_head, n_head_kv, N), n_past, n_embd_head, 0, 0, freq_base, freq_scale); | ||
offload_func_kq(Kcur); | ||
ggml_set_name(Kcur, "Kcur"); | ||
|
||
struct ggml_tensor * Qcur = ggml_rope_custom_inplace(ctx0, ggml_reshape_3d(ctx0, tmpq, n_embd_head, n_head, N), n_past, n_embd_head, 0, 0, freq_base, freq_scale); | ||
offload_func_kq(Qcur); | ||
ggml_set_name(Qcur, "Qcur"); | ||
|
||
// store key and value to memory | ||
{ | ||
// compute the transposed [N, n_embd] V matrix | ||
|
@@ -2323,6 +2320,37 @@ static struct ggml_cgraph * llm_build_llama( | |
ggml_build_forward_expand(gf, ggml_cpy(ctx0, Vcur, v)); | ||
} | ||
|
||
#ifdef LLAMA_SKIP_UNUSED_LOGITS | ||
if (il == n_layer - 1 && !lctx.logits_all) | ||
{ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This does not follow the coding guidelines in the README. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks, fixed. |
||
// From here on, we only care about the last token and its logits. | ||
// We do as if N = 1 (from the end), which means we only keep | ||
// the last column of cur and inpSA ((n_embd, N) -> (n_embd, 1)). | ||
// | ||
// Note that we do this even when N==1 so that we don't change the # nodes in the graph, | ||
// otherwise for Metal we'd have to rebuild the concurrency list. | ||
|
||
cur = ggml_view_2d(ctx0, cur, n_embd, 1, cur->nb[1], (N - 1)*ggml_element_size(cur)*n_embd); | ||
offload_func_skip(cur); | ||
ggml_set_name(cur, "cur-lastpos"); | ||
|
||
inpSA = ggml_view_2d(ctx0, inpSA, n_embd, 1, inpSA->nb[1], (N - 1)*ggml_element_size(inpSA)*n_embd); | ||
offload_func_skip(inpSA); | ||
ggml_set_name(inpSA, "inpSA-lastpos"); | ||
|
||
n_past += N - 1; | ||
N = 1; | ||
} | ||
#endif // LLAMA_SKIP_UNUSED_LOGITS | ||
|
||
struct ggml_tensor * tmpq = ggml_mul_mat(ctx0, model.layers[il].wq, cur); | ||
offload_func_kq(tmpq); | ||
ggml_set_name(tmpq, "tmpq"); | ||
|
||
struct ggml_tensor * Qcur = ggml_rope_custom_inplace(ctx0, ggml_reshape_3d(ctx0, tmpq, n_embd_head, n_head, N), n_past, n_embd_head, 0, 0, freq_base, freq_scale); | ||
offload_func_kq(Qcur); | ||
ggml_set_name(Qcur, "Qcur"); | ||
|
||
struct ggml_tensor * Q = ggml_permute(ctx0, Qcur, 0, 2, 1, 3); | ||
offload_func_kq(Q); | ||
ggml_set_name(Q, "Q"); | ||
|
@@ -2936,11 +2964,18 @@ static bool llama_eval_internal( | |
|
||
if (lctx.logits_all) { | ||
logits_out.resize(n_vocab * N); | ||
GGML_ASSERT(ggml_nelements(res) == n_vocab * N); | ||
memcpy(logits_out.data(), (float *) ggml_get_data(res), sizeof(float)*n_vocab*N); | ||
} else { | ||
// return result for just the last token | ||
logits_out.resize(n_vocab); | ||
#ifdef LLAMA_SKIP_UNUSED_LOGITS | ||
GGML_ASSERT(ggml_nelements(res) == n_vocab); | ||
memcpy(logits_out.data(), (float *) ggml_get_data(res), sizeof(float)*n_vocab); | ||
#else | ||
GGML_ASSERT(ggml_nelements(res) == n_vocab * N); | ||
memcpy(logits_out.data(), (float *) ggml_get_data(res) + (n_vocab*(N-1)), sizeof(float)*n_vocab); | ||
#endif | ||
} | ||
} | ||
|
||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you defining
offload_func_skip
? If I understand your code correctly it works by discarding the unneeded parts of tensors in the last layer. I don't understand why this would affect the offloading logic, i.e. whether data should be stored in RAM or VRAM.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping
offload_func_skip
being distinct would help things, but there seems to be some entangling anyway.Not sure I understand the interplay between
ggml_cuda_compute_forward
and the CPU fallback route, but I've noticed the former skips GGML_OP_VIEW if its input isn't on the GPU, so I've now tried to make sure the offload ofcur
andinpSA
's subviews matches whatever backend they currently were using the following defensive code, and... it seems to fix the issue:The issue being when
-ngl 1
,i_gpu_start = n_layer - 1
so the last layer is the only one that offloads most of its tensors (as per theif (il >= i_gpu_start) { offload_func = ggml_cuda_assign_buffers_no_alloc; }
block).Another simpler fix (pushed on this PR) that also seems to work is to offload the view inputs (but that may be offloading more than intended?):
Will try to find nicer ways to fix this (wondering if the CPU view fallback in a mostly GPU context is behaving as expected?), suggestions welcome 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The decision for whether or not CUDA should be used is based on data location. If the data of any of the tensors is in VRAM then all of the calculations are done on the GPU and the data is copied to/from VRAM as necessary based on backend. My preferred design would have been to add a state to the ggml context that sets the backend for all newly created tensors. But unfortunately this way vetoed by Georgi who wanted to avoid GPU offloading logic in ggml. So the current design is kind of cancerous where you have to manually offload each tensor. For tensors that don't actually do any computations this is very awkward because there is no step where the data would be carried over if necessary. Maybe I should add an explicit check for those tensors to ensure that the backends are consistent.