[FSDP] Add gradient predivide factor to avoid overflow/underflow with large world size #565

shruti-bh · 2021-04-01T00:13:56Z

This PR tries to optimally scale gradients when the world_size is big.
For example, reduce_scatter(gradient) / world_size might overflow and reduce_scatter(gradient / world_size) might underflow with world_size=1024. We might prefer instead reduce_scatter(gradient / 32) / 32.
See Nvidia’s note about this

Would love advice on how to proceed given these differences which would tend to emerge since we are scaling the gradients a bit differently now.

myleott

A couple comments, but otherwise LG, thanks!

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

myleott · 2021-04-01T14:37:47Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

@@ -1071,7 +1078,7 @@ def _post_backward_hook(self, param: Parameter, *unused: Any) -> None:

            if self.world_size > 1:


change to if self.gradient_predivide_factor > 1

myleott · 2021-04-01T14:38:52Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

@@ -1098,7 +1105,7 @@ def _post_reduction_hook(self, param: Parameter, reduced_grad: torch.Tensor) ->
        assert torch.cuda.current_stream() == self._streams["post_backward"]
        assert param.grad is not None
        self.assert_state(TrainingState.BACKWARD_POST)
-        param.grad.data = reduced_grad
+        param.grad.data = reduced_grad.div_(self.world_size/self.gradient_predivide_factor)


change to:

param.grad.data = reduced_grad if self.gradient_postdivide_factor > 1: param.grad.data.div_(self.gradient_postdivide_factor)

myleott · 2021-04-03T21:29:54Z

@shruti-bh I pushed the lint fixes and merged this 😄

shruti-bh added 2 commits March 30, 2021 21:41

add gradient predivide factor to FSDP

79fbe5f

infer the possible best gradient predivide factor

b1b2ca9

shruti-bh requested review from myleott and sshleifer April 1, 2021 00:13

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 1, 2021

remove debugging st

92a655e

myleott approved these changes Apr 1, 2021

View reviewed changes

sshleifer approved these changes Apr 1, 2021

View reviewed changes

Fix lint, CR comments

1c28c3c

myleott merged commit 04001e7 into master Apr 3, 2021

myleott deleted the fsdp_grad_predivide_factor branch April 3, 2021 21:29

awgu mentioned this pull request Feb 1, 2024

[FSDP2] Added reduce-scatter pytorch/pytorch#117975

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP] Add gradient predivide factor to avoid overflow/underflow with large world size #565

[FSDP] Add gradient predivide factor to avoid overflow/underflow with large world size #565

shruti-bh commented Apr 1, 2021

myleott left a comment

myleott Apr 1, 2021

myleott Apr 1, 2021

myleott commented Apr 3, 2021

		@@ -1071,7 +1078,7 @@ def _post_backward_hook(self, param: Parameter, *unused: Any) -> None:

		if self.world_size > 1:

[FSDP] Add gradient predivide factor to avoid overflow/underflow with large world size #565

[FSDP] Add gradient predivide factor to avoid overflow/underflow with large world size #565

Conversation

shruti-bh commented Apr 1, 2021

myleott left a comment

Choose a reason for hiding this comment

myleott Apr 1, 2021

Choose a reason for hiding this comment

myleott Apr 1, 2021

Choose a reason for hiding this comment

myleott commented Apr 3, 2021