New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a new script to reshard model parallel parts #556
Conversation
Is it possible to run $ git mv so that the new script becomes a second version of the original one? Makes it easier to review. Also lint failed, maybe run black? |
Thank you for the review. I did use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job in getting this to work! The logic here is tricky.
I don't think we had fully validated the previous script either, so I think a good exercise for us to do once all the scripts are merged is to make sure the resharded model behind the API matches the validation ppl seen during training
commit 511504b Author: Susan Zhang <suchenzang@users.noreply.github.com> Date: Sun Jan 1 17:00:25 2023 +0100 Init for model_parallel == 1 (facebookresearch#577) * gate by arch, not by mp size * add back mp > 1 conditional commit 59403be Author: Susan Zhang <suchenzang@users.noreply.github.com> Date: Sun Jan 1 00:42:37 2023 +0100 [Cleanup] Remove MegatronTrainer (facebookresearch#576) commit 6687b6f Author: Susan Zhang <suchenzang@users.noreply.github.com> Date: Sat Dec 31 17:38:28 2022 +0100 use bash (facebookresearch#575) commit a87e08f Author: Stephen Roller <roller@fb.com> Date: Fri Dec 30 14:11:57 2022 -0500 Add Sharan to CODEOWNERS (facebookresearch#558) commit 1d4af00 Author: Stephen Roller <roller@fb.com> Date: Fri Dec 30 14:11:47 2022 -0500 Fix config.yml dump in training runs. (facebookresearch#557) commit ed85aad Author: Christian Clauss <cclauss@me.com> Date: Fri Dec 30 07:43:41 2022 +0100 Current flake8 no longer accepts comments on config lines (facebookresearch#570) * Current flake8 no longer accepts comments on config lines `ValueError: Error code '#' supplied to 'extend-ignore' option does not match '^[A-Z]{1,3}[0-9]{0,3}$'` * flake8==6.0.0 * Update .flake8 * Update setup.py Co-authored-by: Stephen Roller <roller@fb.com> Co-authored-by: Stephen Roller <roller@fb.com> commit db6842b Author: Taichi Nishimura <lack_un@yahoo.co.jp> Date: Fri Dec 30 12:14:49 2022 +0900 Add backslash to the script in projects/OPT/download_opt175b.md (facebookresearch#573) * add backslash to script * add backslash to docs/api.md commit 966561e Author: Binh Tang <tangbinh.na@gmail.com> Date: Wed Dec 28 13:11:39 2022 -0800 Add a new script to reshard model parallel parts (facebookresearch#556) Co-authored-by: Binh Tang <tangbinhna@gmail.com>
commit 92033ae Merge: fea5f00 511504b Author: sahagar <68225900+sahagar@users.noreply.github.com> Date: Mon Jan 2 16:11:59 2023 +0530 Merge branch 'main' of https://github.com/sahagar/metaseq commit fea5f00 Author: sahagar <68225900+sahagar@users.noreply.github.com> Date: Mon Jan 2 16:04:04 2023 +0530 add back reshard script commit 802aa42 Author: sahagar <68225900+sahagar@users.noreply.github.com> Date: Mon Jan 2 15:57:35 2023 +0530 Squashed commit of the following: commit 511504b Author: Susan Zhang <suchenzang@users.noreply.github.com> Date: Sun Jan 1 17:00:25 2023 +0100 Init for model_parallel == 1 (facebookresearch#577) * gate by arch, not by mp size * add back mp > 1 conditional commit 59403be Author: Susan Zhang <suchenzang@users.noreply.github.com> Date: Sun Jan 1 00:42:37 2023 +0100 [Cleanup] Remove MegatronTrainer (facebookresearch#576) commit 6687b6f Author: Susan Zhang <suchenzang@users.noreply.github.com> Date: Sat Dec 31 17:38:28 2022 +0100 use bash (facebookresearch#575) commit a87e08f Author: Stephen Roller <roller@fb.com> Date: Fri Dec 30 14:11:57 2022 -0500 Add Sharan to CODEOWNERS (facebookresearch#558) commit 1d4af00 Author: Stephen Roller <roller@fb.com> Date: Fri Dec 30 14:11:47 2022 -0500 Fix config.yml dump in training runs. (facebookresearch#557) commit ed85aad Author: Christian Clauss <cclauss@me.com> Date: Fri Dec 30 07:43:41 2022 +0100 Current flake8 no longer accepts comments on config lines (facebookresearch#570) * Current flake8 no longer accepts comments on config lines `ValueError: Error code '#' supplied to 'extend-ignore' option does not match '^[A-Z]{1,3}[0-9]{0,3}$'` * flake8==6.0.0 * Update .flake8 * Update setup.py Co-authored-by: Stephen Roller <roller@fb.com> Co-authored-by: Stephen Roller <roller@fb.com> commit db6842b Author: Taichi Nishimura <lack_un@yahoo.co.jp> Date: Fri Dec 30 12:14:49 2022 +0900 Add backslash to the script in projects/OPT/download_opt175b.md (facebookresearch#573) * add backslash to script * add backslash to docs/api.md commit 966561e Author: Binh Tang <tangbinh.na@gmail.com> Date: Wed Dec 28 13:11:39 2022 -0800 Add a new script to reshard model parallel parts (facebookresearch#556) Co-authored-by: Binh Tang <tangbinhna@gmail.com> commit 511504b Author: Susan Zhang <suchenzang@users.noreply.github.com> Date: Sun Jan 1 17:00:25 2023 +0100 Init for model_parallel == 1 (facebookresearch#577) * gate by arch, not by mp size * add back mp > 1 conditional commit 59403be Author: Susan Zhang <suchenzang@users.noreply.github.com> Date: Sun Jan 1 00:42:37 2023 +0100 [Cleanup] Remove MegatronTrainer (facebookresearch#576) commit 6687b6f Author: Susan Zhang <suchenzang@users.noreply.github.com> Date: Sat Dec 31 17:38:28 2022 +0100 use bash (facebookresearch#575) commit a87e08f Author: Stephen Roller <roller@fb.com> Date: Fri Dec 30 14:11:57 2022 -0500 Add Sharan to CODEOWNERS (facebookresearch#558) commit 1d4af00 Author: Stephen Roller <roller@fb.com> Date: Fri Dec 30 14:11:47 2022 -0500 Fix config.yml dump in training runs. (facebookresearch#557) commit ed85aad Author: Christian Clauss <cclauss@me.com> Date: Fri Dec 30 07:43:41 2022 +0100 Current flake8 no longer accepts comments on config lines (facebookresearch#570) * Current flake8 no longer accepts comments on config lines `ValueError: Error code '#' supplied to 'extend-ignore' option does not match '^[A-Z]{1,3}[0-9]{0,3}$'` * flake8==6.0.0 * Update .flake8 * Update setup.py Co-authored-by: Stephen Roller <roller@fb.com> Co-authored-by: Stephen Roller <roller@fb.com> commit db6842b Author: Taichi Nishimura <lack_un@yahoo.co.jp> Date: Fri Dec 30 12:14:49 2022 +0900 Add backslash to the script in projects/OPT/download_opt175b.md (facebookresearch#573) * add backslash to script * add backslash to docs/api.md commit 966561e Author: Binh Tang <tangbinh.na@gmail.com> Date: Wed Dec 28 13:11:39 2022 -0800 Add a new script to reshard model parallel parts (facebookresearch#556) Co-authored-by: Binh Tang <tangbinhna@gmail.com> commit 4eb133c Merge: a9b23dd b929eef Author: sahagar <68225900+sahagar@users.noreply.github.com> Date: Tue Dec 27 18:45:48 2022 +0000 Merge branch 'facebookresearch:main' into main
* simplify cli * add error pipe * distributed training updates * bug fix * bug fixes * update * updates * updates * bug fix * updates * bug fix * bug fix * bug fix * updates * bug fix * try updates * comment out excessive info printed in terminal * updates * check point updates * remove excess logs * updates * bug fix * add cpu debug job * update * update * update scripts * Squashed commit of the following: commit 92033ae Merge: fea5f00 511504b Author: sahagar <68225900+sahagar@users.noreply.github.com> Date: Mon Jan 2 16:11:59 2023 +0530 Merge branch 'main' of https://github.com/sahagar/metaseq commit fea5f00 Author: sahagar <68225900+sahagar@users.noreply.github.com> Date: Mon Jan 2 16:04:04 2023 +0530 add back reshard script commit 802aa42 Author: sahagar <68225900+sahagar@users.noreply.github.com> Date: Mon Jan 2 15:57:35 2023 +0530 Squashed commit of the following: commit 511504b Author: Susan Zhang <suchenzang@users.noreply.github.com> Date: Sun Jan 1 17:00:25 2023 +0100 Init for model_parallel == 1 (facebookresearch#577) * gate by arch, not by mp size * add back mp > 1 conditional commit 59403be Author: Susan Zhang <suchenzang@users.noreply.github.com> Date: Sun Jan 1 00:42:37 2023 +0100 [Cleanup] Remove MegatronTrainer (facebookresearch#576) commit 6687b6f Author: Susan Zhang <suchenzang@users.noreply.github.com> Date: Sat Dec 31 17:38:28 2022 +0100 use bash (facebookresearch#575) commit a87e08f Author: Stephen Roller <roller@fb.com> Date: Fri Dec 30 14:11:57 2022 -0500 Add Sharan to CODEOWNERS (facebookresearch#558) commit 1d4af00 Author: Stephen Roller <roller@fb.com> Date: Fri Dec 30 14:11:47 2022 -0500 Fix config.yml dump in training runs. (facebookresearch#557) commit ed85aad Author: Christian Clauss <cclauss@me.com> Date: Fri Dec 30 07:43:41 2022 +0100 Current flake8 no longer accepts comments on config lines (facebookresearch#570) * Current flake8 no longer accepts comments on config lines `ValueError: Error code '#' supplied to 'extend-ignore' option does not match '^[A-Z]{1,3}[0-9]{0,3}$'` * flake8==6.0.0 * Update .flake8 * Update setup.py Co-authored-by: Stephen Roller <roller@fb.com> Co-authored-by: Stephen Roller <roller@fb.com> commit db6842b Author: Taichi Nishimura <lack_un@yahoo.co.jp> Date: Fri Dec 30 12:14:49 2022 +0900 Add backslash to the script in projects/OPT/download_opt175b.md (facebookresearch#573) * add backslash to script * add backslash to docs/api.md commit 966561e Author: Binh Tang <tangbinh.na@gmail.com> Date: Wed Dec 28 13:11:39 2022 -0800 Add a new script to reshard model parallel parts (facebookresearch#556) Co-authored-by: Binh Tang <tangbinhna@gmail.com> commit 511504b Author: Susan Zhang <suchenzang@users.noreply.github.com> Date: Sun Jan 1 17:00:25 2023 +0100 Init for model_parallel == 1 (facebookresearch#577) * gate by arch, not by mp size * add back mp > 1 conditional commit 59403be Author: Susan Zhang <suchenzang@users.noreply.github.com> Date: Sun Jan 1 00:42:37 2023 +0100 [Cleanup] Remove MegatronTrainer (facebookresearch#576) commit 6687b6f Author: Susan Zhang <suchenzang@users.noreply.github.com> Date: Sat Dec 31 17:38:28 2022 +0100 use bash (facebookresearch#575) commit a87e08f Author: Stephen Roller <roller@fb.com> Date: Fri Dec 30 14:11:57 2022 -0500 Add Sharan to CODEOWNERS (facebookresearch#558) commit 1d4af00 Author: Stephen Roller <roller@fb.com> Date: Fri Dec 30 14:11:47 2022 -0500 Fix config.yml dump in training runs. (facebookresearch#557) commit ed85aad Author: Christian Clauss <cclauss@me.com> Date: Fri Dec 30 07:43:41 2022 +0100 Current flake8 no longer accepts comments on config lines (facebookresearch#570) * Current flake8 no longer accepts comments on config lines `ValueError: Error code '#' supplied to 'extend-ignore' option does not match '^[A-Z]{1,3}[0-9]{0,3}$'` * flake8==6.0.0 * Update .flake8 * Update setup.py Co-authored-by: Stephen Roller <roller@fb.com> Co-authored-by: Stephen Roller <roller@fb.com> commit db6842b Author: Taichi Nishimura <lack_un@yahoo.co.jp> Date: Fri Dec 30 12:14:49 2022 +0900 Add backslash to the script in projects/OPT/download_opt175b.md (facebookresearch#573) * add backslash to script * add backslash to docs/api.md commit 966561e Author: Binh Tang <tangbinh.na@gmail.com> Date: Wed Dec 28 13:11:39 2022 -0800 Add a new script to reshard model parallel parts (facebookresearch#556) Co-authored-by: Binh Tang <tangbinhna@gmail.com> commit 4eb133c Merge: a9b23dd b929eef Author: sahagar <68225900+sahagar@users.noreply.github.com> Date: Tue Dec 27 18:45:48 2022 +0000 Merge branch 'facebookresearch:main' into main
Summary of Changes
The existing script for resharding model parallel parts (i.e.
metaseq/scripts/reshard_model_parallel.py
) loads all checkpoint parts at once and might result in OOM issues under RAM constraints, especially for very large models. Here, we rewrite the script and optimize for memory usage by first allocating an unsharded model state dict and iteratively merging model parallel parts into it.Previously, peak memory usage was close to 2X model size as we needed to hold input and output state dicts, but theoretically it's closer to 1x model size now thanks to the iterative process.
The new script produces the same output as
metaseq/scripts/reshard_model_parallel.py
. We delete it to avoid duplication and note that the old script still remains accessible in the internal repo (see this script).Test Plan
metaseq/scripts/reshard_model_parallel.py
while resharding an OPT-175B checkpoint from 8 MP parts into 16 MP parts. The old script takes 849.40 seconds and results in a peak RSS delta of 668,301 MB while the new script takes 891.65 seconds and has RSS delta of 458,185 MB (a 46% reduction in RAM usage).