Failure to reproduce MLA > MHA #23

faresobeid · 2024-05-12T03:47:30Z

I tried out MLA and it was a good amount worse than MHA and wanted to try to find out why. Firstly, I am using a hybrid model therefore I am not using any Rope in either MLA or MHA, and therefore use the basic version of MLA. I suspect the issue could be due to the part saying:
"In addition, the low-rank compression and fine-grained expert segmentation
will impact the output scale of a layer. Therefore, in practice, we employ additional RMS Norm
layers after the compressed latent vectors, and multiply additional scaling factors at the width
bottlenecks (i.e., the compressed latent vectors and the intermediate hidden states of routed experts) to ensure stable training."
It is unclear if the additional scaling factor is done before or after the RMSNorm, also what this factor would be. Another reason could be that the rope version of MLA gives it a performance boost.

Any clarification on this scaling factor and its placement would be great, thanks!

luofuli · 2024-05-14T05:16:31Z

The following are factors that affect the final result:

Rope positional embedding
Scaling factors (you can check the open-source checkpoint)

@faresobeid

faresobeid · 2024-05-14T14:06:31Z

Thank you!

faresobeid · 2024-05-23T12:03:54Z

Sorry to reopen this issue but I have been having some issues with stability at scale with MLA. Like I said before I am using a hybrid model and therefore the MLA that I am using is the basic version with no Rope.

kv = W_b(rms_norm(W_a(x)))
I have also tried having

kv = W_b(rms_norm(W_a(ln(x))))
but that also has some issues with performance and stability. To be more specific, this is a 24 layer model with model dimension 2048 with the last third of layers using MLA (~300M params). Is there any recommended scaling factors or ways to mitigate this issue? Thank you

luofuli · 2024-05-27T11:10:04Z

24-layer dense model? @faresobeid

faresobeid · 2024-05-27T11:11:49Z

Yes, although stability has been fine without the inner rms norm but still any recommendations would be helpful

faresobeid closed this as completed May 14, 2024

faresobeid reopened this May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failure to reproduce MLA > MHA #23

Failure to reproduce MLA > MHA #23

faresobeid commented May 12, 2024

luofuli commented May 14, 2024 •

edited

Loading

faresobeid commented May 14, 2024

faresobeid commented May 23, 2024

luofuli commented May 27, 2024

faresobeid commented May 27, 2024

Failure to reproduce MLA > MHA #23

Failure to reproduce MLA > MHA #23

Comments

faresobeid commented May 12, 2024

luofuli commented May 14, 2024 • edited Loading

faresobeid commented May 14, 2024

faresobeid commented May 23, 2024

luofuli commented May 27, 2024

faresobeid commented May 27, 2024

luofuli commented May 14, 2024 •

edited

Loading