[Benchmark] Fix ZeRO-3 step log by comaniac · Pull Request #31 · awslabs/slapo

comaniac · 2023-01-31T03:17:05Z

Description

Fix the benchmark utility train_with_torch to consider micro batch when printing the log. Now it accepts optional micro_batch_size and only prints the loss per global batch.

Checklist

PR's title starts with a category (e.g. [Bugfix], [Model], [Tutorial], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

cc @zarzen

examples/utils.py

comaniac · 2023-01-31T19:00:25Z

Per offline discussion, we now have two functions to deal with different cases:

train_with_deepspeed_engine: Train with ZeRO but not pipeline. This function uses DeepSpeed model APIs .global_steps to print loss, so that we don't need to worry about batch size and DP size.
train_with_torch: Train with PyTorch native runtime. In this case we assume no DP, PP, and gradient accumulation, so each micro batch is just the global batch. This is currently only used by WideResNet w. TP.

In addition, this PR also reduces the number of steps in CI to reduce the CI time.

[Benchmark] Fix ZeRO-3 step log

bf32940

zarzen reviewed Jan 31, 2023

View reviewed changes

examples/utils.py Outdated Show resolved Hide resolved

fix

263c973

comaniac mentioned this pull request Jan 31, 2023

[Bugfix] Fix for sharding TP only #32

Merged

4 tasks

custom loss_fn

96eb5e1

zarzen approved these changes Jan 31, 2023

View reviewed changes

zarzen merged commit 1b1ef74 into awslabs:main Jan 31, 2023

comaniac deleted the fix_zero3_log branch February 3, 2023 23:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Benchmark] Fix ZeRO-3 step log#31

[Benchmark] Fix ZeRO-3 step log#31
zarzen merged 3 commits intoawslabs:mainfrom
comaniac:fix_zero3_log

comaniac commented Jan 31, 2023

Uh oh!

Uh oh!

comaniac commented Jan 31, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

comaniac commented Jan 31, 2023

Description

Checklist

Uh oh!

Uh oh!

comaniac commented Jan 31, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants