-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Description
TLDR;
My theory is that spreading the use of the CPU 'sparsely' across the entire forward pass phase instead of the slowdown being concentrated on the final layers should be much faster for MoE, since you are not always using the same end layers.
Explanation
The thing with traditional offloading for dense models is that, it makes sense to have a strictly defined cutoff point where it goes from GPU to CPU after a certain layer, because that way you avoid less CPU to GPU transfer (you always have to use the same end layers on a dense model).
With MoE, there's no guarantee that expert 7 will be used in the last layers of the forward pass, and it seems that it often times doesn't have to.
A lot of the time, the last experts are used so rarely throughout all layer computations (of a single token) that it would make sense to prioritize loading as many full experts as possible:
This, of course, doesn't hold universally, and in aggregate the the share of experts is quite even. But because of autoregressive generation, longer sequences of tokens might be able to be computed more quickly for a net speed gain, even if generation is slightly slower at times [which I doubt].
But even if this isn't the case, and you have the same amount of performance either way (for 1:1 VRAM use comparisons), you would still have finer control over how many actual layers to offload, and therefore this should in theory give you better overall VRAM utilization (as instead of having to load exactly 20 groups of 8 layers you could specify the actual amount that fits in VRAM). For example, at the moment, -ngl 12
layers is equivalent to 96 actual layers for Mixtral, as it specifies that the first 4 layers of all 8 experts are offloaded.
But I think that it's more likely in practice that you would only be occasionally hitting a CPU bottleneck (for individual experts that are not fully offloaded) rather than hitting it constantly across all experts at the point of the forward pass where you stop computing on GPU and start computing on CPU.
If you could specify the actual number of layers, the CPU bottleneck would be more 'evenly distributed' based on actual expert use rather than being a constant slowdown after a certain point in the forward pass. And you'd probably end up communicating with the CPU less often in general outside of massive batches, but that part is unproven.