Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip computation of much of last layer & unused logits during prompt eval / large N #2700

Closed
wants to merge 15 commits into from
Closed
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
5 changes: 5 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -80,6 +80,7 @@ option(LLAMA_METAL "llama: use Metal"
option(LLAMA_MPI "llama: use MPI" OFF)
option(LLAMA_K_QUANTS "llama: use k-quants" ON)
option(LLAMA_QKK_64 "llama: use super-block size of 64 for k-quants" OFF)
option(LLAMA_SKIP_UNUSED_LOGITS "llama: skip computation of unused logits" ON)

option(LLAMA_BUILD_TESTS "llama: build tests" ${LLAMA_STANDALONE})
option(LLAMA_BUILD_EXAMPLES "llama: build examples" ${LLAMA_STANDALONE})
Expand Down Expand Up @@ -390,6 +391,10 @@ if (LLAMA_HIPBLAS)
endif()
endif()

if (LLAMA_SKIP_UNUSED_LOGITS)
add_compile_definitions(LLAMA_SKIP_UNUSED_LOGITS)
endif()

if (LLAMA_ALL_WARNINGS)
if (NOT MSVC)
set(c_flags
Expand Down
5 changes: 5 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -326,6 +326,11 @@ k_quants.o: k_quants.c k_quants.h
$(CC) $(CFLAGS) -c $< -o $@
endif # LLAMA_NO_K_QUANTS

ifndef LLAMA_NO_SKIP_UNUSED_LOGITS
CFLAGS += -DLLAMA_SKIP_UNUSED_LOGITS
CXXFLAGS += -DLLAMA_SKIP_UNUSED_LOGITS
endif

#
# Print build information
#
Expand Down
59 changes: 47 additions & 12 deletions llama.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2156,7 +2156,8 @@ static struct ggml_cgraph * llm_build_llama(

GGML_ASSERT((!tokens && embd) || (tokens && !embd)); // NOLINT

const int N = n_tokens;
// Non-const to allow short-circuiting to the last token in the last layer in prompt eval mode.
int N = n_tokens;

const auto & model = lctx.model;
const auto & hparams = model.hparams;
Expand Down Expand Up @@ -2229,9 +2230,10 @@ static struct ggml_cgraph * llm_build_llama(
//
// with the low VRAM option VRAM scratch is disabled in llama_load_model_internal
// in that case ggml_cuda_assign_buffers has no effect
offload_func_t offload_func_nr = llama_nop; // nr = non-repeating
offload_func_t offload_func_kq = llama_nop;
offload_func_t offload_func_v = llama_nop;
offload_func_t offload_func_nr = llama_nop; // nr = non-repeating
offload_func_t offload_func_kq = llama_nop;
offload_func_t offload_func_v = llama_nop;
offload_func_t offload_func_skip = llama_nop;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you defining offload_func_skip? If I understand your code correctly it works by discarding the unneeded parts of tensors in the last layer. I don't understand why this would affect the offloading logic, i.e. whether data should be stored in RAM or VRAM.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping offload_func_skip being distinct would help things, but there seems to be some entangling anyway.

Not sure I understand the interplay between ggml_cuda_compute_forward and the CPU fallback route, but I've noticed the former skips GGML_OP_VIEW if its input isn't on the GPU, so I've now tried to make sure the offload of cur and inpSA's subviews matches whatever backend they currently were using the following defensive code, and... it seems to fix the issue:

auto cur_on_gpu = cur->backend == GGML_BACKEND_GPU;
cur   = ggml_view_2d(ctx0, cur,   n_embd, 1,   cur->nb[1], (N - 1)*ggml_element_size(cur)*n_embd);
if (cur_on_gpu) ggml_cuda_assign_buffers_no_alloc(cur);

auto inpSA_on_gpu = inpSA->backend == GGML_BACKEND_GPU;
inpSA = ggml_view_2d(ctx0, inpSA, n_embd, 1, inpSA->nb[1], (N - 1)*ggml_element_size(inpSA)*n_embd);
if (inpSA_on_gpu) ggml_cuda_assign_buffers_no_alloc(inpSA);

The issue being when -ngl 1, i_gpu_start = n_layer - 1 so the last layer is the only one that offloads most of its tensors (as per the if (il >= i_gpu_start) { offload_func = ggml_cuda_assign_buffers_no_alloc; } block).

Another simpler fix (pushed on this PR) that also seems to work is to offload the view inputs (but that may be offloading more than intended?):

offload_func(cur);
cur   = ggml_view_2d(ctx0, cur,   n_embd, 1,   cur->nb[1], (N - 1)*ggml_element_size(cur)*n_embd);
offload_func(cur);

offload_func(inpSA);
inpSA = ggml_view_2d(ctx0, inpSA, n_embd, 1, inpSA->nb[1], (N - 1)*ggml_element_size(inpSA)*n_embd);
offload_func(inpSA);

Will try to find nicer ways to fix this (wondering if the CPU view fallback in a mostly GPU context is behaving as expected?), suggestions welcome 🙂

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the interplay between ggml_cuda_compute_forward and the CPU fallback route

The decision for whether or not CUDA should be used is based on data location. If the data of any of the tensors is in VRAM then all of the calculations are done on the GPU and the data is copied to/from VRAM as necessary based on backend. My preferred design would have been to add a state to the ggml context that sets the backend for all newly created tensors. But unfortunately this way vetoed by Georgi who wanted to avoid GPU offloading logic in ggml. So the current design is kind of cancerous where you have to manually offload each tensor. For tensors that don't actually do any computations this is very awkward because there is no step where the data would be carried over if necessary. Maybe I should add an explicit check for those tensors to ensure that the backends are consistent.


#ifdef GGML_USE_CUBLAS
if (n_gpu_layers > n_layer) {
Expand All @@ -2243,6 +2245,9 @@ static struct ggml_cgraph * llm_build_llama(
if (n_gpu_layers > n_layer + 2) {
offload_func_kq = ggml_cuda_assign_buffers_no_alloc;
}
if (n_gpu_layers > 0) {
offload_func_skip = ggml_cuda_assign_buffers_no_alloc;
}
#endif // GGML_USE_CUBLAS

struct ggml_tensor * KQ_scale = ggml_new_tensor_1d(ctx0, GGML_TYPE_F32, 1);
Expand Down Expand Up @@ -2284,18 +2289,10 @@ static struct ggml_cgraph * llm_build_llama(
offload_func_kq(tmpk);
ggml_set_name(tmpk, "tmpk");

struct ggml_tensor * tmpq = ggml_mul_mat(ctx0, model.layers[il].wq, cur);
offload_func_kq(tmpq);
ggml_set_name(tmpq, "tmpq");

struct ggml_tensor * Kcur = ggml_rope_custom_inplace(ctx0, ggml_reshape_3d(ctx0, tmpk, n_embd_head, n_head_kv, N), n_past, n_embd_head, 0, 0, freq_base, freq_scale);
offload_func_kq(Kcur);
ggml_set_name(Kcur, "Kcur");

struct ggml_tensor * Qcur = ggml_rope_custom_inplace(ctx0, ggml_reshape_3d(ctx0, tmpq, n_embd_head, n_head, N), n_past, n_embd_head, 0, 0, freq_base, freq_scale);
offload_func_kq(Qcur);
ggml_set_name(Qcur, "Qcur");

// store key and value to memory
{
// compute the transposed [N, n_embd] V matrix
Expand Down Expand Up @@ -2323,6 +2320,37 @@ static struct ggml_cgraph * llm_build_llama(
ggml_build_forward_expand(gf, ggml_cpy(ctx0, Vcur, v));
}

#ifdef LLAMA_SKIP_UNUSED_LOGITS
if (il == n_layer - 1 && !lctx.logits_all)
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not follow the coding guidelines in the README.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

// From here on, we only care about the last token and its logits.
// We do as if N = 1 (from the end), which means we only keep
// the last column of cur and inpSA ((n_embd, N) -> (n_embd, 1)).
//
// Note that we do this even when N==1 so that we don't change the # nodes in the graph,
// otherwise for Metal we'd have to rebuild the concurrency list.

cur = ggml_view_2d(ctx0, cur, n_embd, 1, cur->nb[1], (N - 1)*ggml_element_size(cur)*n_embd);
offload_func_skip(cur);
ggml_set_name(cur, "cur-lastpos");

inpSA = ggml_view_2d(ctx0, inpSA, n_embd, 1, inpSA->nb[1], (N - 1)*ggml_element_size(inpSA)*n_embd);
offload_func_skip(inpSA);
ggml_set_name(inpSA, "inpSA-lastpos");

n_past += N - 1;
N = 1;
}
#endif // LLAMA_SKIP_UNUSED_LOGITS

struct ggml_tensor * tmpq = ggml_mul_mat(ctx0, model.layers[il].wq, cur);
offload_func_kq(tmpq);
ggml_set_name(tmpq, "tmpq");

struct ggml_tensor * Qcur = ggml_rope_custom_inplace(ctx0, ggml_reshape_3d(ctx0, tmpq, n_embd_head, n_head, N), n_past, n_embd_head, 0, 0, freq_base, freq_scale);
offload_func_kq(Qcur);
ggml_set_name(Qcur, "Qcur");

struct ggml_tensor * Q = ggml_permute(ctx0, Qcur, 0, 2, 1, 3);
offload_func_kq(Q);
ggml_set_name(Q, "Q");
Expand Down Expand Up @@ -2936,11 +2964,18 @@ static bool llama_eval_internal(

if (lctx.logits_all) {
logits_out.resize(n_vocab * N);
GGML_ASSERT(ggml_nelements(res) == n_vocab * N);
memcpy(logits_out.data(), (float *) ggml_get_data(res), sizeof(float)*n_vocab*N);
} else {
// return result for just the last token
logits_out.resize(n_vocab);
#ifdef LLAMA_SKIP_UNUSED_LOGITS
GGML_ASSERT(ggml_nelements(res) == n_vocab);
memcpy(logits_out.data(), (float *) ggml_get_data(res), sizeof(float)*n_vocab);
#else
GGML_ASSERT(ggml_nelements(res) == n_vocab * N);
memcpy(logits_out.data(), (float *) ggml_get_data(res) + (n_vocab*(N-1)), sizeof(float)*n_vocab);
#endif
}
}

Expand Down