Solve hanging `load_model` and let LRFind be ran in a distributed setup #3689

muellerzr · 2022-06-15T16:21:19Z

Solve `load_model` hanging in distributed setup

What does this add?

This PR guts some of the distributed logic in load_model and replaces it with better Accelerate logic for loading the Learner

Who is it for?

Closes #3688 and #3132

Why is it needed?

Currently we have a distrib_barrier inside of load_model, which one would consider to be enough to avoid deadlocks but its not. Instead we need to have all model weights loaded in first before moving on with whatever we are about to do, causing a deadlock hang when calling Learn.lr_find() in a distributed situation.

What parts of the API does this impact?

User-facing:

Nothing

Internal structure:

Since in distributed setups we assume that Accelerate will always be used, Learner has a learn.accelerator attribute. We can then call Accelerator's wait_for_everyone method to ensure that all models are loaded before moving on, avoiding the deadlock.

When would I use it, and when wouldn't I?

When would you: A distributed run
When wouldn't you: When you're not (which is why it's a nested_attr + noop

I've tested this on my own 2gpu system and lr find ran to completion without issue.

cc @jph00

review-notebook-app · 2022-06-15T16:21:25Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

load_model bugfix

f987bb8

muellerzr requested a review from jph00 as a code owner June 15, 2022 16:21

muellerzr changed the title ~~Solves hanging load_model~~ Solve hanging load_model and let LRFind be ran in a distributed setup Jun 15, 2022

muellerzr mentioned this pull request Jun 15, 2022

learn.lr_find does not work in distributed setup. #3688

Closed

jph00 added the bug label Jun 19, 2022

jph00 merged commit fbf689f into fastai:master Jun 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solve hanging `load_model` and let LRFind be ran in a distributed setup #3689

Solve hanging `load_model` and let LRFind be ran in a distributed setup #3689

muellerzr commented Jun 15, 2022 •

edited

review-notebook-app bot commented Jun 15, 2022

Solve hanging load_model and let LRFind be ran in a distributed setup #3689

Solve hanging load_model and let LRFind be ran in a distributed setup #3689

Conversation

muellerzr commented Jun 15, 2022 • edited

Solve load_model hanging in distributed setup

What does this add?

Who is it for?

Why is it needed?

What parts of the API does this impact?

User-facing:

Internal structure:

When would I use it, and when wouldn't I?

review-notebook-app bot commented Jun 15, 2022

Solve hanging `load_model` and let LRFind be ran in a distributed setup #3689

Solve hanging `load_model` and let LRFind be ran in a distributed setup #3689

muellerzr commented Jun 15, 2022 •

edited

Solve `load_model` hanging in distributed setup