Skip to content

[FEATURE REQUEST] - Make MoE offloading offload real layers rather than groups of layers #4667

@kalomaze

Description

@kalomaze

TLDR;

My theory is that spreading the use of the CPU 'sparsely' across the entire forward pass phase instead of the slowdown being concentrated on the final layers should be much faster for MoE, since you are not always using the same end layers.

Explanation

The thing with traditional offloading for dense models is that, it makes sense to have a strictly defined cutoff point where it goes from GPU to CPU after a certain layer, because that way you avoid less CPU to GPU transfer (you always have to use the same end layers on a dense model).

With MoE, there's no guarantee that expert 7 will be used in the last layers of the forward pass, and it seems that it often times doesn't have to.
A lot of the time, the last experts are used so rarely throughout all layer computations (of a single token) that it would make sense to prioritize loading as many full experts as possible:

image

This, of course, doesn't hold universally, and in aggregate the the share of experts is quite even. But because of autoregressive generation, longer sequences of tokens might be able to be computed more quickly for a net speed gain, even if generation is slightly slower at times [which I doubt].

But even if this isn't the case, and you have the same amount of performance either way (for 1:1 VRAM use comparisons), you would still have finer control over how many actual layers to offload, and therefore this should in theory give you better overall VRAM utilization (as instead of having to load exactly 20 groups of 8 layers you could specify the actual amount that fits in VRAM). For example, at the moment, -ngl 12 layers is equivalent to 96 actual layers for Mixtral, as it specifies that the first 4 layers of all 8 experts are offloaded.

But I think that it's more likely in practice that you would only be occasionally hitting a CPU bottleneck (for individual experts that are not fully offloaded) rather than hitting it constantly across all experts at the point of the forward pass where you stop computing on GPU and start computing on CPU.

If you could specify the actual number of layers, the CPU bottleneck would be more 'evenly distributed' based on actual expert use rather than being a constant slowdown after a certain point in the forward pass. And you'd probably end up communicating with the CPU less often in general outside of massive batches, but that part is unproven.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions