[FSDP] use all_gather for 10X OSD consolidation speedup #595

sshleifer · 2021-04-08T17:50:33Z

TLDR: Using all_gather instead of broadcast for optimizer state consolidation appears to be a speed win without a memory cost.

Approach:

for tensor state in OSD we use all_gather
for non-tensor metadata (loss_scale, param_groups, num_padded) we use broadcast.
The same OSD (for 300M param model) takes 110 ms to consolidate vs 2300 ms, speedup usually around 10X.
assume recipient_rank=0, since there are no other callers.

Evidence of Win:

This appears not to use extra GPU RAM, and save a lot of time. For the unittests, the process goes from 2300 MS to consolidate a larger MOE to 110 Ms.
For a large fairseq 2.2B param model (1.2T config), it takes 80s instead of 800s, with less CUDA usage.
fairseq resumption test (resume training from ckpt should result as same loss as training from scratch) passes.

sshleifer · 2021-04-08T21:27:01Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

+
+        return non_tensor_state, tensor_state
+
+    def _gather_optim_state(self, sd_state: Dict[int, Dict[str, Any]]) -> Dict[int, Dict[str, List]]:


This is the new _all_gather logic

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

sshleifer · 2021-04-09T01:25:35Z

tests/nn/data_parallel/test_fsdp.py

@@ -627,15 +627,18 @@ def __init__(self, group, wrapper_config, checkpoint_act=False, delay_before_fre

        # "expert" params are different on each rank
        torch.manual_seed(42 + group.rank())
-        d_expert = 16
-        expert = nn.Linear(d_expert, 4)
+        d_expert = 23


make sure we unpad expert params correctly.

sshleifer · 2021-04-13T00:41:14Z

Planning to merge 10am PT tomorrow, barring further comments.

min-xu-ai

I don't have time to fully review this, but this looks good at a high level.

min-xu-ai · 2021-04-13T01:13:52Z

fairscale/nn/data_parallel/fully_sharded_data_parallel.py

        """Return the last known global optimizer state. The returned state is compatible with Pytorch, in that the
        sharded properties are not exposed. Multiple parameter groups are not yet supported.

        This should be called only on the root FSDP instance.
+        Nested FSDP instances are supported as long as they have the same world_size as the parent or world_size=1.


min-xu-ai · 2021-04-13T01:17:04Z

cc @QuentinDuval @prigoyal FYI

min-xu-ai · 2021-04-13T01:20:17Z

Planning to merge 10am PT tomorrow, barring further comments.

sorry for the delay, I was out for most of the last week.

fairscale/nn/data_parallel/fsdp_optim_utils.py

myleott · 2021-04-13T13:41:27Z

fairscale/nn/data_parallel/fsdp_optim_utils.py

    new_sd = {"state": new_state, "param_groups": copy.deepcopy(sd["param_groups"])}
+    for k in sd.keys():  # if there are extra keys, like loss_scale, don't delete them
+        if k not in {"state", "param_groups", "uncollected_local_ids", "param_id_map"}:


I'm slightly uneasy about this falling out of sync with line 160. Thoughts on some way to enforce parity between them?

module level constant + comment + assert

# These return keys are used by fairseq. To change, add @sshleifer as a reviewer. UNFLAT_RETURN_KEYS = {"state", "param_groups", "uncollected_local_ids", "param_id_map"} ... assert set(unflat_optim_state_dict.keys()) == UNFLAT_RETURN_KEYS

myleott

LGTM!

sshleifer added 7 commits April 7, 2021 17:00

boom boom

6f58b45

Merge branch 'master' into dbg-fairseq

7ebe6ae

Appears to save mem, cost time. BM running

b75664c

passing

713cf83

boom boom

34f077b

boom boom

ae01b50

run all gather

e77301c

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 8, 2021

sshleifer changed the title ~~[FSDP/Prototype]~~ [FSDP/Prototype] use all gather for OSD consolidation Apr 8, 2021

sshleifer added 6 commits April 8, 2021 14:03

Cleaner

e2ab23e

tfmr tests broken

07bdff8

Tests passing

7b31b56

Merge branch 'master' into all-gather-impl

6691d00

Passing, besides mypy

23e1f2f

mypy

7795854

sshleifer commented Apr 8, 2021

View reviewed changes

sshleifer mentioned this pull request Apr 8, 2021

[fix] [FSDP] optim state dict should be completely on CPU #590

Merged

Base automatically changed from move-placeholder-cpu to master April 8, 2021 23:07

sshleifer added 2 commits April 8, 2021 19:11

merge master

3750d30

passing, cleaner

5c52536

sshleifer closed this Apr 9, 2021

sshleifer reopened this Apr 9, 2021

sshleifer marked this pull request as ready for review April 9, 2021 01:21

sshleifer changed the title ~~[FSDP/Prototype] use all gather for OSD consolidation~~ [FSDP] use all_gather for 10X OSD consolidation speedup Apr 9, 2021

sshleifer commented Apr 9, 2021

View reviewed changes

sshleifer requested review from min-xu-ai and myleott April 9, 2021 01:26

sshleifer added 3 commits April 8, 2021 22:03

Smaller delta

b42b129

Support --fp16-adam-stats

93a1ae2

Rm print

1b6b8f3

sshleifer added 3 commits April 11, 2021 23:08

cleanup

004e6ad

Cleanup

3884dd9

nicer docs

aa00204

min-xu-ai approved these changes Apr 13, 2021

View reviewed changes

More fp16adam support

9d83935

myleott reviewed Apr 13, 2021

View reviewed changes

fairscale/nn/data_parallel/fsdp_optim_utils.py Outdated Show resolved Hide resolved

myleott reviewed Apr 13, 2021

View reviewed changes

myleott approved these changes Apr 13, 2021

View reviewed changes

add test, myle CR

c7e6a0b

sshleifer added the FSDP FullyShardedDataParallel (zero-3) label Apr 13, 2021

sshleifer merged commit a82825d into master Apr 13, 2021

sshleifer deleted the all-gather-impl branch April 13, 2021 15:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSDP] use all_gather for 10X OSD consolidation speedup #595

[FSDP] use all_gather for 10X OSD consolidation speedup #595

sshleifer commented Apr 8, 2021 •

edited

sshleifer Apr 8, 2021

sshleifer Apr 9, 2021

sshleifer commented Apr 13, 2021

min-xu-ai left a comment

min-xu-ai Apr 13, 2021

min-xu-ai commented Apr 13, 2021

min-xu-ai commented Apr 13, 2021

myleott Apr 13, 2021

sshleifer Apr 13, 2021

myleott left a comment


		return non_tensor_state, tensor_state

		def _gather_optim_state(self, sd_state: Dict[int, Dict[str, Any]]) -> Dict[int, Dict[str, List]]:

[FSDP] use all_gather for 10X OSD consolidation speedup #595

[FSDP] use all_gather for 10X OSD consolidation speedup #595

Conversation

sshleifer commented Apr 8, 2021 • edited

Approach:

Evidence of Win:

sshleifer Apr 8, 2021

Choose a reason for hiding this comment

sshleifer Apr 9, 2021

Choose a reason for hiding this comment

sshleifer commented Apr 13, 2021

min-xu-ai left a comment

Choose a reason for hiding this comment

min-xu-ai Apr 13, 2021

Choose a reason for hiding this comment

min-xu-ai commented Apr 13, 2021

min-xu-ai commented Apr 13, 2021

myleott Apr 13, 2021

Choose a reason for hiding this comment

sshleifer Apr 13, 2021

Choose a reason for hiding this comment

myleott left a comment

Choose a reason for hiding this comment

sshleifer commented Apr 8, 2021 •

edited