-
Notifications
You must be signed in to change notification settings - Fork 96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure to reproduce MLA > MHA #23
Comments
The following are factors that affect the final result:
|
Thank you! |
Sorry to reopen this issue but I have been having some issues with stability at scale with MLA. Like I said before I am using a hybrid model and therefore the MLA that I am using is the basic version with no Rope.
|
24-layer dense model? @faresobeid |
Yes, although stability has been fine without the inner rms norm but still any recommendations would be helpful |
I tried out MLA and it was a good amount worse than MHA and wanted to try to find out why. Firstly, I am using a hybrid model therefore I am not using any Rope in either MLA or MHA, and therefore use the basic version of MLA. I suspect the issue could be due to the part saying:
"In addition, the low-rank compression and fine-grained expert segmentation
will impact the output scale of a layer. Therefore, in practice, we employ additional RMS Norm
layers after the compressed latent vectors, and multiply additional scaling factors at the width
bottlenecks (i.e., the compressed latent vectors and the intermediate hidden states of routed experts) to ensure stable training."
It is unclear if the additional scaling factor is done before or after the RMSNorm, also what this factor would be. Another reason could be that the rope version of MLA gives it a performance boost.
Any clarification on this scaling factor and its placement would be great, thanks!
The text was updated successfully, but these errors were encountered: