Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : combine expert tensors into a single tensor #6082

Closed
ggerganov opened this issue Mar 15, 2024 · 1 comment · Fixed by #6387
Closed

llama : combine expert tensors into a single tensor #6082

ggerganov opened this issue Mar 15, 2024 · 1 comment · Fixed by #6387
Labels
high priority Very important issue refactoring Refactoring

Comments

@ggerganov
Copy link
Owner

Currently, we store separate tensors for each expert:

llama.cpp/ggml.c

Lines 4442 to 4455 in 3020327

result->op = GGML_OP_MUL_MAT_ID;
result->grad = is_node ? ggml_dup_tensor(ctx, result) : NULL;
result->src[0] = ids;
result->src[1] = b;
for (int i = 0; i < n_as; i++) {
struct ggml_tensor * a = as[i];
GGML_ASSERT(ggml_are_same_shape(as[0], a));
GGML_ASSERT(ggml_can_mul_mat(a, b));
GGML_ASSERT(!ggml_is_transposed(a));
result->src[i + 2] = a;
}

This leads to large number of possible "source" tensors for the _id ops which increases significantly the size of struct ggml_tensor on the stack:

llama.cpp/ggml.h

Lines 573 to 576 in 3020327

struct ggml_tensor * grad;
struct ggml_tensor * src[GGML_MAX_SRC];

Additionally, the Metal implementation is currently hacked to support up to 8 experts and extension to more than that is not completely obvious:

llama.cpp/ggml-metal.m

Lines 1750 to 1759 in 3020327

// TODO: how to make this an array? read Metal docs
for (int j = 0; j < 8; ++j) {
// NOTE: this is done like this to avoid uninitialized kernel arguments when n_as < 8
struct ggml_tensor * src_cur = dst->src[2 + (j % n_as)];
size_t offs_src_cur = 0;
id<MTLBuffer> id_src_cur = ggml_metal_get_buffer(src_cur, &offs_src_cur);
[encoder setBuffer:id_src_cur offset:offs_src_cur atIndex:19 + j];
}

We should improve this, with one possible way being to store the data for the experts into a single tensor and address is with appropriate offsets

@slaren
Copy link
Collaborator

slaren commented Mar 27, 2024

I am not sure if we can implement this change while maintaining compatibility with existing models without breaking mmap, since we need to modify the layout of the tensors. I think that maintaining backwards compatibility with models with split experts is important, we should not ask people to re-download 50GB models, but we may have to disable mmap with old models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority Very important issue refactoring Refactoring
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants