-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there a plan to support multi-node traning? #5
Comments
Multi-node training is something we're planning to start looking into very soon (in the next week). Unfortunately our cluster is down for maintenance for the next ~5 days, so we won't be able to do any development/testing before then. Trying to secure alternative compute, but unsure if anything will come through before our cluster is back. If you have access to a multi-node cluster, I think you could try running our code in multinode with relatively few modifications. The discussion in this issue might be a starting point for what needs to change when going from single node to multinode. Happy to discuss/debug if you have the time/compute to try it out yourself :) |
Thanks for your quickly response ! |
Excuse me, is multi-node traning almost ready? |
Sorry for the slow progress on this- the last few weeks have been much busier than expected. I don't have a clear timeline for multi-node at this point, unfortunately. I might be able to test some things this week, but with ICML prep I'm not 100% sure. |
+1 |
1 similar comment
+1 |
I haven't found a good multi-node best practice for FSDP, have you tried it? Thank you in advance. :)
The text was updated successfully, but these errors were encountered: