Solve hanging load_model
and let LRFind be ran in a distributed setup
#3689
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Solve
load_model
hanging in distributed setupWhat does this add?
This PR guts some of the distributed logic in
load_model
and replaces it with better Accelerate logic for loading the LearnerWho is it for?
Closes #3688 and #3132
Why is it needed?
Currently we have a
distrib_barrier
inside ofload_model
, which one would consider to be enough to avoid deadlocks but its not. Instead we need to have all model weights loaded in first before moving on with whatever we are about to do, causing a deadlock hang when callingLearn.lr_find()
in a distributed situation.What parts of the API does this impact?
User-facing:
Nothing
Internal structure:
Since in distributed setups we assume that Accelerate will always be used, Learner has a
learn.accelerator
attribute. We can then call Accelerator's wait_for_everyone method to ensure that all models are loaded before moving on, avoiding the deadlock.When would I use it, and when wouldn't I?
When would you: A distributed run
When wouldn't you: When you're not (which is why it's a
nested_attr
+noop
I've tested this on my own 2gpu system and lr find ran to completion without issue.
cc @jph00