Fix fsdp+pp+te WPS decreasing issue #1139

jianyuh · 2023-10-01T22:05:52Z

What does this PR do?

Fixes WPS decreasing with steps, due to the FSDP handle issues.

Credits all go to @awgu @vivien-chu . Merge this into ngoyal_changes_for_pp_fp8 due to the failure observed in test_te.py when reproducing https://github.com/fairinternal/xlformers/pull/1360 w/ @jiecaoyu @jspark1105

-- Process 5 terminated with the following error:
Traceback (most recent call last):
  File "/home/jianyuhuang/Work/Github/pytorch/torch/multiprocessing/spawn.py", line 74, in _wrap
    fn(i, *args)
  File "/home/jianyuhuang/Work/Github/xlformers/tests/test_te.py", line 470, in run_demo
    loss.backward()
  File "/home/jianyuhuang/Work/Github/pytorch/torch/_tensor.py", line 491, in backward
    torch.autograd.backward(
  File "/home/jianyuhuang/Work/Github/pytorch/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
SystemError: <built-in method run_backward of torch._C._EngineBase object at 0x7fba963bbc80> returned NULL without setting an exception

Before submitting

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

jspark1105 · 2023-10-01T23:38:48Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

@@ -1650,6 +1654,9 @@ def _register_post_backward_hooks(self) -> None:
                assert p_tmp.grad_fn is not None
                grad_acc = p_tmp.grad_fn.next_functions[0][0]  # Gets its GradAccumulation object.
                handle = grad_acc.register_hook(functools.partial(self._post_backward_hook, p))
+                if not hasattr(p, "_shard_bwd_hooks"):
+                    p._shard_bwd_hooks = []
+                p._shard_bwd_hooks.append((grad_acc, handle))
                p._shard_bwd_hook = (grad_acc, handle)


This should be deleted? See P841842878 CC @awgu

Commented this line following the style in this file.

jspark1105 · 2023-10-01T23:39:21Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

+            for module_name, module in self.named_modules():
+                if isinstance(module, FullyShardedDataParallel):
+                    module._module_fqn = module_name


These only needed for debugging (when _FSDP_DEBUG is set in P841842878)

jspark1105 · 2023-10-01T23:43:46Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

            if self.fp32_reduce_scatter:
                # Cast grad to FP32.
-                param.grad.data = param.grad.data.float()
+                orig_grad_data = param.grad.data.float()
+            else:
+                orig_grad_data = param.grad.data


We should keep param.grad.data = param.grad.data.float() so something like

if self.fp32_reduce_scatter: # Cast grad to FP32. param.grad.data = param.grad.data.float() orig_grad_data = param.grad.data

jspark1105 · 2023-10-02T00:25:19Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

@@ -1710,6 +1713,13 @@ def _post_backward_hook(self, param: Parameter, *unused: Any) -> None:

        # Switch to FP32 shard after backward.
        self._use_fp32_param_shard([param])
+        if self.mixed_precision and self.fp32_reduce_scatter:


Sorry missed this comment before. Wonder if we can have a separate commit for main_grad related changes from the changes for wps decrease fix (P841842878).

Split into 2 PRs (#1139 and #1140)

Fix fsdp+pp+te WPS decreasing issue

02c4036

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 1, 2023

jianyuh requested review from jspark1105, jiecaoyu, vedanuj, awgu and vivien-chu October 1, 2023 23:22

jianyuh marked this pull request as ready for review October 1, 2023 23:23

jspark1105 reviewed Oct 1, 2023

View reviewed changes

Address comment; remove unused stuff

65837b2

jspark1105 reviewed Oct 2, 2023

View reviewed changes

split into wps fix P841842878 only and main_grad fix

45cd038

jianyuh mentioned this pull request Oct 2, 2023

Add main_grad #1140

Open

10 tasks

jspark1105 approved these changes Oct 2, 2023

View reviewed changes

awgu approved these changes Oct 2, 2023

View reviewed changes

jianyuh merged commit 0db6e62 into ngoyal_changes_for_pp_fp8 Oct 2, 2023
1 of 18 checks passed

awgu mentioned this pull request Oct 9, 2023

Cleared backward hooks to avoid accumulating over iterations #1143

Merged

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix fsdp+pp+te WPS decreasing issue #1139

Fix fsdp+pp+te WPS decreasing issue #1139

jianyuh commented Oct 1, 2023 •

edited

jspark1105 Oct 1, 2023

jianyuh Oct 2, 2023

jspark1105 Oct 1, 2023

jianyuh Oct 2, 2023

jspark1105 Oct 1, 2023

jianyuh Oct 2, 2023

jspark1105 Oct 2, 2023

jianyuh Oct 2, 2023

Fix fsdp+pp+te WPS decreasing issue #1139

Fix fsdp+pp+te WPS decreasing issue #1139

Conversation

jianyuh commented Oct 1, 2023 • edited

What does this PR do?

Before submitting

PR review

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jianyuh commented Oct 1, 2023 •

edited