Skip to content

[Bugfix] Fix for sharding TP only#32

Merged
comaniac merged 7 commits intoawslabs:mainfrom
zarzen:zhen/tp-fix
Feb 1, 2023
Merged

[Bugfix] Fix for sharding TP only#32
comaniac merged 7 commits intoawslabs:mainfrom
zarzen:zhen/tp-fix

Conversation

@zarzen
Copy link
Contributor

@zarzen zarzen commented Jan 31, 2023

Description

  • When TP is enabled and PP is disabled, the last linear layer is sharded. Thus, we need ParallelCrossEntropy as the loss instead of the loss from HF.
  • Using TP with ZeRO-1 optimizer in DS runtime requires mpu as input to the deespeed.initialize(...) function.
  • When --disable_pipeline is used for disabling pipeline parallelism, we need to get the num_mp parameters from the args.

Known issue

  • With the fix, the loss converge to about 7.x after 200 steps, which is still slower than ZeRO-1 or ZeRO-3 only.
  • For now only tested with GPT model.

Checklist

  • PR's title starts with a category (e.g. [Bugfix], [Model], [Tutorial], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Zhen Zhang and others added 2 commits January 31, 2023 21:53
Co-authored-by: Cody Yu <comaniac0422@gmail.com>
Copy link
Contributor

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM.

@comaniac comaniac merged commit cfae589 into awslabs:main Feb 1, 2023
@comaniac
Copy link
Contributor

comaniac commented Feb 1, 2023

Thanks @zarzen

@comaniac
Copy link
Contributor

comaniac commented Feb 1, 2023

Found that this PR breaks 3D scripts (which are not covered in CI yet). @zarzen please file a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants