Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new script to reshard model parallel parts #556

Merged
merged 1 commit into from Dec 28, 2022
Merged

Conversation

tangbinh
Copy link
Contributor

@tangbinh tangbinh commented Dec 21, 2022

Summary of Changes

The existing script for resharding model parallel parts (i.e. metaseq/scripts/reshard_model_parallel.py) loads all checkpoint parts at once and might result in OOM issues under RAM constraints, especially for very large models. Here, we rewrite the script and optimize for memory usage by first allocating an unsharded model state dict and iteratively merging model parallel parts into it.

Previously, peak memory usage was close to 2X model size as we needed to hold input and output state dicts, but theoretically it's closer to 1x model size now thanks to the iterative process.

The new script produces the same output as metaseq/scripts/reshard_model_parallel.py. We delete it to avoid duplication and note that the old script still remains accessible in the internal repo (see this script).

Test Plan

  • Run the script with an OPT 2.7B checkpoint to reshard 4 MP parts into 8 MP parts and make sure the resulting checkpoint performs reasonably:
    seq 0 3 | parallel --line-buffer 'python metaseq/scripts/reshard_fsdp.py --input "/data/checkpoints/opt-2.7b/raw/checkpoint_last-model_part-{}-shard*.pt" --output "/data/checkpoints/opt-2.7b/reshard-no-os/reshard-model_part-{}.pt" --skip-optimizer-state True --unflatten-weights True --output-dtype fp16'
    python -m metaseq.scripts.reshard_mp --input "/data/checkpoints/opt-2.7b/reshard_no_os/reshard-model_part-*.pt" --output "/data/checkpoints/opt-2.7b/reshard_no_os_mp8/reshard-model_part-{i}.pt" --num-output-parts 8
    
    python metaseq/scripts/interactive.py --merges-filename /data/checkpoints/gpt2-merges.txt --vocab-filename /data/checkpoints/gpt2-vocab.json --path /data/checkpoints/opt-2.7b/reshard_no_os_mp8/reshard.pt --model-parallel-size 8 --distributed-world-size 8  --beam 3 --max-source-positions 4 --max-target-positions 128
    
    > Prompt: What is the meaning of life?
    Output: To be happy.
    
  • We compare performance with metaseq/scripts/reshard_model_parallel.py while resharding an OPT-175B checkpoint from 8 MP parts into 16 MP parts. The old script takes 849.40 seconds and results in a peak RSS delta of 668,301 MB while the new script takes 891.65 seconds and has RSS delta of 458,185 MB (a 46% reduction in RAM usage).
    python metaseq/scripts/reshard_model_parallel.py --pth_prefix /data/checkpoints/opt-175b/reshard_no_os_unflat/reshard.pt --new-model-parts 16 --save-prefix  /data/checkpoints/opt-175b/reshard_no_os_unflat_mp16_ref/reshard.pt
    python -m metaseq.scripts.reshard_mp --input "/data/checkpoints/opt-175b/reshard_no_os_unflat/reshard-model_part-*.pt" --output "/data/checkpoints/opt-175b/reshard_no_os_unflat_mp16/reshard-model_part-{i}.pt" --num-output-parts 16
    

@ruanslv
Copy link
Contributor

ruanslv commented Dec 21, 2022

Is it possible to run $ git mv so that the new script becomes a second version of the original one? Makes it easier to review.

Also lint failed, maybe run black?

@tangbinh
Copy link
Contributor Author

Is it possible to run $ git mv so that the new script becomes a second version of the original one? Makes it easier to review.

Also lint failed, maybe run black?

Thank you for the review. I did use git mv but it didn't trigger the comparison of file contents (perhaps they are too different). Please see updated PR for the linter fix.

Copy link
Contributor

@ruanslv ruanslv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job in getting this to work! The logic here is tricky.

I don't think we had fully validated the previous script either, so I think a good exercise for us to do once all the scripts are merged is to make sure the resharded model behind the API matches the validation ppl seen during training

metaseq/scripts/reshard_mp.py Outdated Show resolved Hide resolved
metaseq/scripts/reshard_mp.py Outdated Show resolved Hide resolved
metaseq/scripts/reshard_mp.py Show resolved Hide resolved
@tangbinh tangbinh merged commit 966561e into main Dec 28, 2022
@tangbinh tangbinh deleted the reshard-mp branch December 28, 2022 21:11
sahajgg added a commit to sahajgg/metaseq that referenced this pull request Jan 2, 2023
commit 511504b
Author: Susan Zhang <suchenzang@users.noreply.github.com>
Date:   Sun Jan 1 17:00:25 2023 +0100

    Init for model_parallel == 1 (facebookresearch#577)

    * gate by arch, not by mp size

    * add back mp > 1 conditional

commit 59403be
Author: Susan Zhang <suchenzang@users.noreply.github.com>
Date:   Sun Jan 1 00:42:37 2023 +0100

    [Cleanup] Remove MegatronTrainer (facebookresearch#576)

commit 6687b6f
Author: Susan Zhang <suchenzang@users.noreply.github.com>
Date:   Sat Dec 31 17:38:28 2022 +0100

    use bash (facebookresearch#575)

commit a87e08f
Author: Stephen Roller <roller@fb.com>
Date:   Fri Dec 30 14:11:57 2022 -0500

    Add Sharan to CODEOWNERS (facebookresearch#558)

commit 1d4af00
Author: Stephen Roller <roller@fb.com>
Date:   Fri Dec 30 14:11:47 2022 -0500

    Fix config.yml dump in training runs. (facebookresearch#557)

commit ed85aad
Author: Christian Clauss <cclauss@me.com>
Date:   Fri Dec 30 07:43:41 2022 +0100

    Current flake8 no longer accepts comments on config lines (facebookresearch#570)

    * Current flake8 no longer accepts comments on config lines

    `ValueError: Error code '#' supplied to 'extend-ignore' option does not match '^[A-Z]{1,3}[0-9]{0,3}$'`

    * flake8==6.0.0

    * Update .flake8

    * Update setup.py

    Co-authored-by: Stephen Roller <roller@fb.com>

    Co-authored-by: Stephen Roller <roller@fb.com>

commit db6842b
Author: Taichi Nishimura <lack_un@yahoo.co.jp>
Date:   Fri Dec 30 12:14:49 2022 +0900

    Add backslash to the script in projects/OPT/download_opt175b.md (facebookresearch#573)

    * add backslash to script

    * add backslash to docs/api.md

commit 966561e
Author: Binh Tang <tangbinh.na@gmail.com>
Date:   Wed Dec 28 13:11:39 2022 -0800

    Add a new script to reshard model parallel parts (facebookresearch#556)

    Co-authored-by: Binh Tang <tangbinhna@gmail.com>
sahajgg added a commit to sahajgg/metaseq that referenced this pull request Jan 2, 2023
commit 92033ae
Merge: fea5f00 511504b
Author: sahagar <68225900+sahagar@users.noreply.github.com>
Date:   Mon Jan 2 16:11:59 2023 +0530

    Merge branch 'main' of https://github.com/sahagar/metaseq

commit fea5f00
Author: sahagar <68225900+sahagar@users.noreply.github.com>
Date:   Mon Jan 2 16:04:04 2023 +0530

    add back reshard script

commit 802aa42
Author: sahagar <68225900+sahagar@users.noreply.github.com>
Date:   Mon Jan 2 15:57:35 2023 +0530

    Squashed commit of the following:

    commit 511504b
    Author: Susan Zhang <suchenzang@users.noreply.github.com>
    Date:   Sun Jan 1 17:00:25 2023 +0100

        Init for model_parallel == 1 (facebookresearch#577)

        * gate by arch, not by mp size

        * add back mp > 1 conditional

    commit 59403be
    Author: Susan Zhang <suchenzang@users.noreply.github.com>
    Date:   Sun Jan 1 00:42:37 2023 +0100

        [Cleanup] Remove MegatronTrainer (facebookresearch#576)

    commit 6687b6f
    Author: Susan Zhang <suchenzang@users.noreply.github.com>
    Date:   Sat Dec 31 17:38:28 2022 +0100

        use bash (facebookresearch#575)

    commit a87e08f
    Author: Stephen Roller <roller@fb.com>
    Date:   Fri Dec 30 14:11:57 2022 -0500

        Add Sharan to CODEOWNERS (facebookresearch#558)

    commit 1d4af00
    Author: Stephen Roller <roller@fb.com>
    Date:   Fri Dec 30 14:11:47 2022 -0500

        Fix config.yml dump in training runs. (facebookresearch#557)

    commit ed85aad
    Author: Christian Clauss <cclauss@me.com>
    Date:   Fri Dec 30 07:43:41 2022 +0100

        Current flake8 no longer accepts comments on config lines (facebookresearch#570)

        * Current flake8 no longer accepts comments on config lines

        `ValueError: Error code '#' supplied to 'extend-ignore' option does not match '^[A-Z]{1,3}[0-9]{0,3}$'`

        * flake8==6.0.0

        * Update .flake8

        * Update setup.py

        Co-authored-by: Stephen Roller <roller@fb.com>

        Co-authored-by: Stephen Roller <roller@fb.com>

    commit db6842b
    Author: Taichi Nishimura <lack_un@yahoo.co.jp>
    Date:   Fri Dec 30 12:14:49 2022 +0900

        Add backslash to the script in projects/OPT/download_opt175b.md (facebookresearch#573)

        * add backslash to script

        * add backslash to docs/api.md

    commit 966561e
    Author: Binh Tang <tangbinh.na@gmail.com>
    Date:   Wed Dec 28 13:11:39 2022 -0800

        Add a new script to reshard model parallel parts (facebookresearch#556)

        Co-authored-by: Binh Tang <tangbinhna@gmail.com>

commit 511504b
Author: Susan Zhang <suchenzang@users.noreply.github.com>
Date:   Sun Jan 1 17:00:25 2023 +0100

    Init for model_parallel == 1 (facebookresearch#577)

    * gate by arch, not by mp size

    * add back mp > 1 conditional

commit 59403be
Author: Susan Zhang <suchenzang@users.noreply.github.com>
Date:   Sun Jan 1 00:42:37 2023 +0100

    [Cleanup] Remove MegatronTrainer (facebookresearch#576)

commit 6687b6f
Author: Susan Zhang <suchenzang@users.noreply.github.com>
Date:   Sat Dec 31 17:38:28 2022 +0100

    use bash (facebookresearch#575)

commit a87e08f
Author: Stephen Roller <roller@fb.com>
Date:   Fri Dec 30 14:11:57 2022 -0500

    Add Sharan to CODEOWNERS (facebookresearch#558)

commit 1d4af00
Author: Stephen Roller <roller@fb.com>
Date:   Fri Dec 30 14:11:47 2022 -0500

    Fix config.yml dump in training runs. (facebookresearch#557)

commit ed85aad
Author: Christian Clauss <cclauss@me.com>
Date:   Fri Dec 30 07:43:41 2022 +0100

    Current flake8 no longer accepts comments on config lines (facebookresearch#570)

    * Current flake8 no longer accepts comments on config lines

    `ValueError: Error code '#' supplied to 'extend-ignore' option does not match '^[A-Z]{1,3}[0-9]{0,3}$'`

    * flake8==6.0.0

    * Update .flake8

    * Update setup.py

    Co-authored-by: Stephen Roller <roller@fb.com>

    Co-authored-by: Stephen Roller <roller@fb.com>

commit db6842b
Author: Taichi Nishimura <lack_un@yahoo.co.jp>
Date:   Fri Dec 30 12:14:49 2022 +0900

    Add backslash to the script in projects/OPT/download_opt175b.md (facebookresearch#573)

    * add backslash to script

    * add backslash to docs/api.md

commit 966561e
Author: Binh Tang <tangbinh.na@gmail.com>
Date:   Wed Dec 28 13:11:39 2022 -0800

    Add a new script to reshard model parallel parts (facebookresearch#556)

    Co-authored-by: Binh Tang <tangbinhna@gmail.com>

commit 4eb133c
Merge: a9b23dd b929eef
Author: sahagar <68225900+sahagar@users.noreply.github.com>
Date:   Tue Dec 27 18:45:48 2022 +0000

    Merge branch 'facebookresearch:main' into main
sahajgg added a commit to sahajgg/metaseq that referenced this pull request Jan 2, 2023
* simplify cli

* add error pipe

* distributed training updates

* bug fix

* bug fixes

* update

* updates

* updates

* bug fix

* updates

* bug fix

* bug fix

* bug fix

* updates

* bug fix

* try updates

* comment out excessive info printed in terminal

* updates

* check point updates

* remove excess logs

* updates

* bug fix

* add cpu debug job

* update

* update

* update scripts

* Squashed commit of the following:

commit 92033ae
Merge: fea5f00 511504b
Author: sahagar <68225900+sahagar@users.noreply.github.com>
Date:   Mon Jan 2 16:11:59 2023 +0530

    Merge branch 'main' of https://github.com/sahagar/metaseq

commit fea5f00
Author: sahagar <68225900+sahagar@users.noreply.github.com>
Date:   Mon Jan 2 16:04:04 2023 +0530

    add back reshard script

commit 802aa42
Author: sahagar <68225900+sahagar@users.noreply.github.com>
Date:   Mon Jan 2 15:57:35 2023 +0530

    Squashed commit of the following:

    commit 511504b
    Author: Susan Zhang <suchenzang@users.noreply.github.com>
    Date:   Sun Jan 1 17:00:25 2023 +0100

        Init for model_parallel == 1 (facebookresearch#577)

        * gate by arch, not by mp size

        * add back mp > 1 conditional

    commit 59403be
    Author: Susan Zhang <suchenzang@users.noreply.github.com>
    Date:   Sun Jan 1 00:42:37 2023 +0100

        [Cleanup] Remove MegatronTrainer (facebookresearch#576)

    commit 6687b6f
    Author: Susan Zhang <suchenzang@users.noreply.github.com>
    Date:   Sat Dec 31 17:38:28 2022 +0100

        use bash (facebookresearch#575)

    commit a87e08f
    Author: Stephen Roller <roller@fb.com>
    Date:   Fri Dec 30 14:11:57 2022 -0500

        Add Sharan to CODEOWNERS (facebookresearch#558)

    commit 1d4af00
    Author: Stephen Roller <roller@fb.com>
    Date:   Fri Dec 30 14:11:47 2022 -0500

        Fix config.yml dump in training runs. (facebookresearch#557)

    commit ed85aad
    Author: Christian Clauss <cclauss@me.com>
    Date:   Fri Dec 30 07:43:41 2022 +0100

        Current flake8 no longer accepts comments on config lines (facebookresearch#570)

        * Current flake8 no longer accepts comments on config lines

        `ValueError: Error code '#' supplied to 'extend-ignore' option does not match '^[A-Z]{1,3}[0-9]{0,3}$'`

        * flake8==6.0.0

        * Update .flake8

        * Update setup.py

        Co-authored-by: Stephen Roller <roller@fb.com>

        Co-authored-by: Stephen Roller <roller@fb.com>

    commit db6842b
    Author: Taichi Nishimura <lack_un@yahoo.co.jp>
    Date:   Fri Dec 30 12:14:49 2022 +0900

        Add backslash to the script in projects/OPT/download_opt175b.md (facebookresearch#573)

        * add backslash to script

        * add backslash to docs/api.md

    commit 966561e
    Author: Binh Tang <tangbinh.na@gmail.com>
    Date:   Wed Dec 28 13:11:39 2022 -0800

        Add a new script to reshard model parallel parts (facebookresearch#556)

        Co-authored-by: Binh Tang <tangbinhna@gmail.com>

commit 511504b
Author: Susan Zhang <suchenzang@users.noreply.github.com>
Date:   Sun Jan 1 17:00:25 2023 +0100

    Init for model_parallel == 1 (facebookresearch#577)

    * gate by arch, not by mp size

    * add back mp > 1 conditional

commit 59403be
Author: Susan Zhang <suchenzang@users.noreply.github.com>
Date:   Sun Jan 1 00:42:37 2023 +0100

    [Cleanup] Remove MegatronTrainer (facebookresearch#576)

commit 6687b6f
Author: Susan Zhang <suchenzang@users.noreply.github.com>
Date:   Sat Dec 31 17:38:28 2022 +0100

    use bash (facebookresearch#575)

commit a87e08f
Author: Stephen Roller <roller@fb.com>
Date:   Fri Dec 30 14:11:57 2022 -0500

    Add Sharan to CODEOWNERS (facebookresearch#558)

commit 1d4af00
Author: Stephen Roller <roller@fb.com>
Date:   Fri Dec 30 14:11:47 2022 -0500

    Fix config.yml dump in training runs. (facebookresearch#557)

commit ed85aad
Author: Christian Clauss <cclauss@me.com>
Date:   Fri Dec 30 07:43:41 2022 +0100

    Current flake8 no longer accepts comments on config lines (facebookresearch#570)

    * Current flake8 no longer accepts comments on config lines

    `ValueError: Error code '#' supplied to 'extend-ignore' option does not match '^[A-Z]{1,3}[0-9]{0,3}$'`

    * flake8==6.0.0

    * Update .flake8

    * Update setup.py

    Co-authored-by: Stephen Roller <roller@fb.com>

    Co-authored-by: Stephen Roller <roller@fb.com>

commit db6842b
Author: Taichi Nishimura <lack_un@yahoo.co.jp>
Date:   Fri Dec 30 12:14:49 2022 +0900

    Add backslash to the script in projects/OPT/download_opt175b.md (facebookresearch#573)

    * add backslash to script

    * add backslash to docs/api.md

commit 966561e
Author: Binh Tang <tangbinh.na@gmail.com>
Date:   Wed Dec 28 13:11:39 2022 -0800

    Add a new script to reshard model parallel parts (facebookresearch#556)

    Co-authored-by: Binh Tang <tangbinhna@gmail.com>

commit 4eb133c
Merge: a9b23dd b929eef
Author: sahagar <68225900+sahagar@users.noreply.github.com>
Date:   Tue Dec 27 18:45:48 2022 +0000

    Merge branch 'facebookresearch:main' into main
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants