Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix mup for the layers with AttentionLayerMup #494

Merged
merged 8 commits into from
Dec 24, 2023
Merged

Fix mup for the layers with AttentionLayerMup #494

merged 8 commits into from
Dec 24, 2023

Conversation

DomInvivo
Copy link
Collaborator

@DomInvivo DomInvivo commented Dec 20, 2023

Changelogs

  • Added embed_dim to the list of keys to look for when doing the mup kwargs

@maciej-sypetkowski I think this should fix your issue, although I can't verify it because on my end, when I use the config with architecture.mup_scale_factor: 2 it works. Basically, I can't reproduce because I don't know how you do your scaling for it to fail, but at least the attn_layer keys in mup_base_params.yaml are no longer null.

IMPORTANT

when this PR closes, it will affect the reproducibility of your models if they use AttentionLayerMup, such as GPSLayerPyg since the mup will affect the learning rate of these layers, whether they were previously ignored.

Copy link

codecov bot commented Dec 20, 2023

Codecov Report

Merging #494 (4045fcf) into main (8cbf2d0) will increase coverage by 0.17%.
Report is 23 commits behind head on main.
The diff coverage is 100.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #494      +/-   ##
==========================================
+ Coverage   71.35%   71.52%   +0.17%     
==========================================
  Files          94       93       -1     
  Lines        8718     8707      -11     
==========================================
+ Hits         6221     6228       +7     
+ Misses       2497     2479      -18     
Flag Coverage Δ
unittests 71.52% <100.00%> (+0.17%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
ipu 49.14% <ø> (ø)

Comment on lines +1330 to +1332
assert (
x[k] % num_heads == 0
), f"embed_dim={x[k]} is not divisible by num_heads={num_heads}"
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's needed since there's another assertion in AttentionLayerMup. @maciej-sypetkowski can you check if we remove that part when scaling by a factor that's not divisible by num_heads, does it still work?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's not needed

Copy link
Collaborator

@maciej-sypetkowski maciej-sypetkowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Tested and it works

env.yml Outdated
@@ -17,6 +17,7 @@ dependencies:
- pandas >=1.0
- scikit-learn
- fastparquet
- networkx
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it needed now?

@DomInvivo DomInvivo merged commit f698df4 into main Dec 24, 2023
7 checks passed
DomInvivo added a commit that referenced this pull request Dec 24, 2023
Removed double check of embed_dim/num_heads, discussed in PR #494
@DomInvivo DomInvivo mentioned this pull request Dec 24, 2023
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants