Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a plan to support multi-node traning? #5

Open
huybery opened this issue Jun 29, 2023 · 6 comments
Open

Is there a plan to support multi-node traning? #5

huybery opened this issue Jun 29, 2023 · 6 comments

Comments

@huybery
Copy link

huybery commented Jun 29, 2023

I haven't found a good multi-node best practice for FSDP, have you tried it? Thank you in advance. :)

@eric-mitchell
Copy link
Owner

Multi-node training is something we're planning to start looking into very soon (in the next week). Unfortunately our cluster is down for maintenance for the next ~5 days, so we won't be able to do any development/testing before then. Trying to secure alternative compute, but unsure if anything will come through before our cluster is back.

If you have access to a multi-node cluster, I think you could try running our code in multinode with relatively few modifications. The discussion in this issue might be a starting point for what needs to change when going from single node to multinode. Happy to discuss/debug if you have the time/compute to try it out yourself :)

@huybery
Copy link
Author

huybery commented Jun 29, 2023

Thanks for your quickly response !
I'm working on modifications to the multi-node code, but at the moment I'm running into some obstacles. It will hang in the multi-node. I'd be happy to help you debug the multi-node code together, maybe you can develop a version first for me to perform the debugging? I'm worried about missing something key points, as I'm not familiar with FSDP.

@liumingzhu6060
Copy link

Excuse me, is multi-node traning almost ready?

@eric-mitchell
Copy link
Owner

Sorry for the slow progress on this- the last few weeks have been much busier than expected. I don't have a clear timeline for multi-node at this point, unfortunately. I might be able to test some things this week, but with ICML prep I'm not 100% sure.

@AltenLi
Copy link

AltenLi commented Aug 15, 2023

+1

1 similar comment
@LMXKO
Copy link

LMXKO commented Nov 28, 2023

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants