Create a consolidated resharding logic #376

ruanslv · 2022-10-03T21:17:06Z

🚀 Feature Request

We run a single model shard in each GPU, with a combination of data and model parallel. We have a few different ways of doing resharding today (i.e. converting a model from X shards to Y shards), which should be consolidated to a single solution that we trust.

Supported model parallelism for input and output can be restricted to 1/2/4/8 (the number can change between input and output). Overall number of shards can be restricted to powers of 2 to start.

Motivation

We have three ways of resharding today:

Let's consolidate and clean-up the code. This will be useful for having a single code-path to load models later: #78

tangbinh · 2022-10-10T21:31:46Z

@ruanslv @suchenzang It seems that the scripts have different assumptions about the checkpoint inputs and behave quite differently:

consolidate_fsdp_shards.py invokes the consolidate_shard_weights method from FSDP and thus requires the checkpoint to include a shard_metadata key, which doesn't seem to be available in the released checkpoints.
reshard_model_parallel.py just calls the reshard_megatron_parts function from stitch_fsdp_ckpt.py and seems to assume that the weights are not flattened and the corresponding model keys are included in the checkpoint, which isn't the case with the released checkpoints.
convert_to_singleton.py does work with the released checkpoints, but it has additional requirements unrelated to the resharding logic (bpe_vocab, bpe_merges, etc.) in order to launch DDP and instantiate an FSDP object via the LegacyTask, which might not be very flexible (see this issue).

What would the checkpoint inputs for the consolidated script look like? Do we need to support all use cases mentioned above?

suchenzang · 2022-10-11T22:55:12Z

Sounds like we need to deprecate the first two, lol. @stephenroller @ngoyal2707 any takes on the above? 😅

ruanslv added enhancement New feature or request help wanted labels Oct 3, 2022

ruanslv assigned ruanslv and tangbinh and unassigned ruanslv Oct 3, 2022

suchenzang mentioned this issue Oct 6, 2022

World size mismatch for convert_to_singleton.py #378

Closed

suchenzang added better-eng Things that can help make things sane and removed help wanted labels Oct 12, 2022

ruanslv mentioned this issue Oct 14, 2022

convert_to_singleton seems to hang for OPT-66B #407

Closed

tangbinh linked a pull request Oct 26, 2022 that will close this issue

Add a script to reshard FSDP checkpoints #459

Merged

tangbinh closed this as completed in #459 Nov 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a consolidated resharding logic #376

Create a consolidated resharding logic #376

ruanslv commented Oct 3, 2022 •

edited

Loading

tangbinh commented Oct 10, 2022

suchenzang commented Oct 11, 2022

Create a consolidated resharding logic #376

Create a consolidated resharding logic #376

Comments

ruanslv commented Oct 3, 2022 • edited Loading

🚀 Feature Request

Motivation

tangbinh commented Oct 10, 2022

suchenzang commented Oct 11, 2022

ruanslv commented Oct 3, 2022 •

edited

Loading