[fix] OSS - enforce cuda parameters for state consolidation if NCCL backend #573

blefaudeux · 2021-04-02T21:35:14Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
Did you read the contributor guideline?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

Should fix https://fb.workplace.com/groups/pytorchLightning/permalink/1419090048427529/
A usecase that I did not think of was that the model could be moved to cpu() at some point, then OSS state consolidated. In that case the state consolidation would fail because NCCL only supports cuda tensors

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

cc @ananthsub

blefaudeux · 2021-04-02T21:38:55Z

tests/optim/test_oss.py

@@ -470,6 +470,11 @@ def closure():
    _ = optimizer.step(closure=closure)
    check_same_models_across_ranks(model, dist.group.WORLD, params_should_be_equal=True, check_broadcast_buffers=False)

+    # Check that if the model is moved to cpu, the optimizer consolidation still works
+    model.cpu()


without the fix, this unit test does fail with the same error that the user mentioned

blefaudeux · 2021-04-02T21:39:17Z

fairscale/optim/oss.py

@@ -328,6 +328,9 @@ def consolidate_state_dict(self, recipient_rank: int = 0) -> None:
        should_collect_state = self.rank == recipient_rank or recipient_rank == -1
        should_send_state = (self.rank != recipient_rank and recipient_rank != -1) or recipient_rank == -1

+        # NCCL requires CUDA tensors for all communication primitives
+        dist_device = torch.device("cuda") if self.backend == dist.Backend.NCCL else self._default_device


no choice with NCCL, needs to be cuda

if the model is moved back to the cpu and the optimizer state reflects it, why do we call broadcast? The optimizer state is not sharded anymore right? Maybe i am missing something.

the framework is the one calling .consolidate(), it can do so at any time basically. We could add a skip mechanism for when it's called twice in a row (would be even more foolproof actually), but that would not solve the case of train -> move to cpu -> call .consolidate(), which can be legitimate, if unfortunate

(complement) the issue was that if the model is moved to cpu, then some tensors in the optimizer dict are cpu. When consolidating the shards are exchanged towards a specific rank (or all), which breaks with NCCL since it always expects cuda for communication primitives

blefaudeux · 2021-04-02T22:06:54Z

the broken test was unrelated, pipe

fairscale/optim/oss.py

msbaines · 2021-04-03T18:32:37Z

fairscale/optim/oss.py

@@ -340,18 +343,18 @@ def consolidate_state_dict(self, recipient_rank: int = 0) -> None:
                state_to_share = (
                    self.optim.state_dict()
                    if should_send_state
-                    else torch.tensor([0], dtype=torch.uint8, device=self._default_device)
+                    else torch.tensor([0], dtype=torch.uint8, device=dist_device)


This seems wasteful. Why not skip the broadcast in this case instead of sending a zero? In the else below you could check if rank != recipient.

Co-authored-by: msbaines <35972327+msbaines@users.noreply.github.com>

blefaudeux added 2 commits April 2, 2021 21:25

make sure that broadcasts are with a cuda tensor

be4c7f6

fix oss consolidation with NCCL and cpu model

b97ae7e

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 2, 2021

blefaudeux requested review from msbaines, min-xu-ai, joshim5 and anj-s and removed request for msbaines April 2, 2021 21:35

blefaudeux commented Apr 2, 2021

View reviewed changes

anj-s approved these changes Apr 3, 2021

View reviewed changes

msbaines reviewed Apr 3, 2021

View reviewed changes

Update fairscale/optim/oss.py

05e8cd8

Co-authored-by: msbaines <35972327+msbaines@users.noreply.github.com>

blefaudeux merged commit 8855337 into master Apr 4, 2021

blefaudeux deleted the oss_enforce_cuda_parameters branch April 7, 2021 00:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] OSS - enforce cuda parameters for state consolidation if NCCL backend #573

[fix] OSS - enforce cuda parameters for state consolidation if NCCL backend #573

blefaudeux commented Apr 2, 2021 •

edited

blefaudeux Apr 2, 2021

blefaudeux Apr 2, 2021

anj-s Apr 3, 2021

blefaudeux Apr 3, 2021

blefaudeux Apr 3, 2021

blefaudeux commented Apr 2, 2021 •

edited

msbaines Apr 3, 2021

[fix] OSS - enforce cuda parameters for state consolidation if NCCL backend #573

[fix] OSS - enforce cuda parameters for state consolidation if NCCL backend #573

Conversation

blefaudeux commented Apr 2, 2021 • edited

Before submitting

What does this PR do?

PR review

Did you have fun?

blefaudeux Apr 2, 2021

Choose a reason for hiding this comment

blefaudeux Apr 2, 2021

Choose a reason for hiding this comment

anj-s Apr 3, 2021

Choose a reason for hiding this comment

blefaudeux Apr 3, 2021

Choose a reason for hiding this comment

blefaudeux Apr 3, 2021

Choose a reason for hiding this comment

blefaudeux commented Apr 2, 2021 • edited

msbaines Apr 3, 2021

Choose a reason for hiding this comment

blefaudeux commented Apr 2, 2021 •

edited

blefaudeux commented Apr 2, 2021 •

edited