Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solve hanging load_model and let LRFind be ran in a distributed setup #3689

Merged
merged 1 commit into from Jun 19, 2022

Conversation

muellerzr
Copy link
Contributor

@muellerzr muellerzr commented Jun 15, 2022

Solve load_model hanging in distributed setup

What does this add?

This PR guts some of the distributed logic in load_model and replaces it with better Accelerate logic for loading the Learner

Who is it for?

Closes #3688 and #3132

Why is it needed?

Currently we have a distrib_barrier inside of load_model, which one would consider to be enough to avoid deadlocks but its not. Instead we need to have all model weights loaded in first before moving on with whatever we are about to do, causing a deadlock hang when calling Learn.lr_find() in a distributed situation.

What parts of the API does this impact?

User-facing:

Nothing

Internal structure:

Since in distributed setups we assume that Accelerate will always be used, Learner has a learn.accelerator attribute. We can then call Accelerator's wait_for_everyone method to ensure that all models are loaded before moving on, avoiding the deadlock.

When would I use it, and when wouldn't I?

When would you: A distributed run
When wouldn't you: When you're not (which is why it's a nested_attr + noop

I've tested this on my own 2gpu system and lr find ran to completion without issue.

cc @jph00

@muellerzr muellerzr requested a review from jph00 as a code owner June 15, 2022 16:21
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@muellerzr muellerzr changed the title Solves hanging load_model Solve hanging load_model and let LRFind be ran in a distributed setup Jun 15, 2022
@jph00 jph00 added the bug label Jun 19, 2022
@jph00 jph00 merged commit fbf689f into fastai:master Jun 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

learn.lr_find does not work in distributed setup.
2 participants