uniform deepspeed overflow check #5424

GuanhuaWang · 2024-04-16T22:40:19Z

Before: Overflow check is scattered and duplicated in all places.

This PR:

Single interface as CheckOverflow class, which abstract and uniform overflow check among ZeRO, ZeRO-Offload, Pipeline Parallelism, BF16_optimizer.
Skip step() operation if detect gradients overflow in BF6_optimizer. (avoid polluting checkpoint, etc)

cc @tjruwase

Anhelor · 2024-04-20T07:37:49Z

Why not using tensor.isnan() and tensor.isinf()?

tjruwase · 2024-05-04T17:19:58Z

deepspeed/runtime/utils.py

    '''Checks for overflow in gradient across parallel process'''

-    def __init__(self, param_groups=None, mpu=None, zero_reduce_scatter=False, deepspeed=None):
+    def __init__(self, param_groups=None, mpu=None, zero_reduce_scatter=False, deepspeed=None, partition_grads=False):


Passing deepseed engine into a submodule is not a good design and create all sorts of cyclic reference issues. It is better to pass the specific attributes that are needed, such as enable_backward_allreduce

tjruwase · 2024-05-04T17:21:32Z

deepspeed/runtime/engine.py

        timers = self.timers if self.wall_clock_breakdown() else NoopTimer()
        optimizer = BF16_Optimizer(optimizer,
                                   self.param_names,
+                                   deepspeed=self,


Don't pass self into submodules.

tjruwase · 2024-05-04T17:24:14Z

deepspeed/runtime/bf16_optimizer.py

            self._setup_for_real_optimizer()

+        # Overflow check init
+        self.overflow = False


Should self.overflow be a class member since it seems to be only used once?

format

fda7b77

GuanhuaWang requested review from mrwyattii and tjruwase as code owners April 16, 2024 22:40

Merge branch 'master' into guanhua/overflow-check

7bd6aaf

tjruwase reviewed May 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

uniform deepspeed overflow check #5424

uniform deepspeed overflow check #5424

Uh oh!

GuanhuaWang commented Apr 16, 2024 •

edited

Loading

Uh oh!

Anhelor commented Apr 20, 2024

Uh oh!

tjruwase May 4, 2024

Uh oh!

tjruwase May 4, 2024

Uh oh!

tjruwase May 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

uniform deepspeed overflow check #5424

Are you sure you want to change the base?

uniform deepspeed overflow check #5424

Uh oh!

Conversation

GuanhuaWang commented Apr 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Anhelor commented Apr 20, 2024

Uh oh!

tjruwase May 4, 2024

Choose a reason for hiding this comment

Uh oh!

tjruwase May 4, 2024

Choose a reason for hiding this comment

Uh oh!

tjruwase May 4, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

GuanhuaWang commented Apr 16, 2024 •

edited

Loading