Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSDP broken on newer versions of pytorch #450

Open
yocontra opened this issue Apr 21, 2024 · 4 comments
Open

FSDP broken on newer versions of pytorch #450

yocontra opened this issue Apr 21, 2024 · 4 comments

Comments

@yocontra
Copy link

This commit: pytorch/pytorch@a832967
Released in torch 2.1.0 breaks this https://github.com/facebookresearch/audiocraft/blob/main/audiocraft/optim/fsdp.py#L126

This ticket (#358) has an implementation that works on torch 2.1.0 - but there should probably be some backward compatibility for a final implementation.

@nateraw
Copy link

nateraw commented Apr 21, 2024

I think it's worth mentioning that while this patch did work for me for medium models, deadlocks were still so prevalent when trying to train large models that I had to rewrite the trainer in my fork with PyTorch Lightning to make them go away.

So I think there may be some other issues at play here.

@yocontra
Copy link
Author

@nateraw Hmm interesting - done a couple trainings on the large models (using dora) and didn't have any issues with deadlocks

@nateraw
Copy link

nateraw commented May 2, 2024

This may have been hardware specific - I had issues on H100s, but not A100s

@yocontra
Copy link
Author

yocontra commented May 3, 2024

I'm training on A40s and no issues so maybe hardware specific, I did see some other people post issues about H100s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants