[Bugfix] Fix for sharding TP only by zarzen · Pull Request #32 · awslabs/slapo

zarzen · 2023-01-31T18:05:27Z

Description

When TP is enabled and PP is disabled, the last linear layer is sharded. Thus, we need ParallelCrossEntropy as the loss instead of the loss from HF.
Using TP with ZeRO-1 optimizer in DS runtime requires mpu as input to the deespeed.initialize(...) function.
When --disable_pipeline is used for disabling pipeline parallelism, we need to get the num_mp parameters from the args.

Known issue

With the fix, the loss converge to about 7.x after 200 steps, which is still slower than ZeRO-1 or ZeRO-3 only.
For now only tested with GPT model.

Checklist

PR's title starts with a category (e.g. [Bugfix], [Model], [Tutorial], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

examples/utils.py

slapo/model_dialect/deepspeed/engine.py

Co-authored-by: Cody Yu <comaniac0422@gmail.com>

comaniac

Otherwise LGTM.

examples/gpt/schedule.py

comaniac · 2023-02-01T00:33:56Z

Thanks @zarzen

comaniac · 2023-02-01T00:56:45Z

Found that this PR breaks 3D scripts (which are not covered in CI yet). @zarzen please file a fix.

comaniac reviewed Jan 31, 2023

View reviewed changes

examples/utils.py Outdated Show resolved Hide resolved

examples/utils.py Outdated Show resolved Hide resolved

slapo/model_dialect/deepspeed/engine.py Outdated Show resolved Hide resolved

zarzen force-pushed the zhen/tp-fix branch from 48ccc89 to bd5b1f1 Compare January 31, 2023 21:18

Zhen Zhang and others added 2 commits January 31, 2023 21:53

Fix for TP only case

197a0cc

Accept suggestions from cody

ed2dc0f

Co-authored-by: Cody Yu <comaniac0422@gmail.com>

zarzen force-pushed the zhen/tp-fix branch from 6499923 to ed2dc0f Compare January 31, 2023 21:53

Zhen Zhang added 2 commits January 31, 2023 22:00

Pass the parallel loss fn to train fn

4c65082

remove unneccessary code

77e291d

comaniac approved these changes Jan 31, 2023

View reviewed changes

examples/gpt/schedule.py Outdated Show resolved Hide resolved

examples/gpt/schedule.py Show resolved Hide resolved

Zhen Zhang added 3 commits January 31, 2023 22:49

changed the style

0a58db4

trim whitespace

d800172

pylint fix

5786c7c

comaniac merged commit cfae589 into awslabs:main Feb 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Fix for sharding TP only#32

[Bugfix] Fix for sharding TP only#32
comaniac merged 7 commits intoawslabs:mainfrom
zarzen:zhen/tp-fix

zarzen commented Jan 31, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

comaniac left a comment

Uh oh!

Uh oh!

Uh oh!

comaniac commented Feb 1, 2023

Uh oh!

comaniac commented Feb 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zarzen commented Jan 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Known issue

Checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

comaniac left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

comaniac commented Feb 1, 2023

Uh oh!

comaniac commented Feb 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zarzen commented Jan 31, 2023 •

edited

Loading