Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to reproduce MLA > MHA #23

Open
faresobeid opened this issue May 12, 2024 · 5 comments
Open

Failure to reproduce MLA > MHA #23

faresobeid opened this issue May 12, 2024 · 5 comments

Comments

@faresobeid
Copy link

I tried out MLA and it was a good amount worse than MHA and wanted to try to find out why. Firstly, I am using a hybrid model therefore I am not using any Rope in either MLA or MHA, and therefore use the basic version of MLA. I suspect the issue could be due to the part saying:
"In addition, the low-rank compression and fine-grained expert segmentation
will impact the output scale of a layer. Therefore, in practice, we employ additional RMS Norm
layers after the compressed latent vectors, and multiply additional scaling factors at the width
bottlenecks (i.e., the compressed latent vectors and the intermediate hidden states of routed experts) to ensure stable training."
It is unclear if the additional scaling factor is done before or after the RMSNorm, also what this factor would be. Another reason could be that the rope version of MLA gives it a performance boost.

Any clarification on this scaling factor and its placement would be great, thanks!

@luofuli
Copy link
Member

luofuli commented May 14, 2024

The following are factors that affect the final result:

  1. Rope positional embedding
  2. Scaling factors (you can check the open-source checkpoint)

@faresobeid

@faresobeid
Copy link
Author

Thank you!

@faresobeid faresobeid reopened this May 23, 2024
@faresobeid
Copy link
Author

Sorry to reopen this issue but I have been having some issues with stability at scale with MLA. Like I said before I am using a hybrid model and therefore the MLA that I am using is the basic version with no Rope.

kv = W_b(rms_norm(W_a(x)))
I have also tried having

kv = W_b(rms_norm(W_a(ln(x))))
but that also has some issues with performance and stability. To be more specific, this is a 24 layer model with model dimension 2048 with the last third of layers using MLA (~300M params). Is there any recommended scaling factors or ways to mitigate this issue? Thank you

@luofuli
Copy link
Member

luofuli commented May 27, 2024

24-layer dense model? @faresobeid

@faresobeid
Copy link
Author

Yes, although stability has been fine without the inner rms norm but still any recommendations would be helpful

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants