Skip to content

Conversation

@GuanhuaWang
Copy link
Contributor

@GuanhuaWang GuanhuaWang commented Apr 16, 2024

Before: Overflow check is scattered and duplicated in all places.

This PR:

  • Single interface as CheckOverflow class, which abstract and uniform overflow check among ZeRO, ZeRO-Offload, Pipeline Parallelism, BF16_optimizer.
  • Skip step() operation if detect gradients overflow in BF6_optimizer. (avoid polluting checkpoint, etc)

cc @tjruwase

@Anhelor
Copy link

Anhelor commented Apr 20, 2024

Why not using tensor.isnan() and tensor.isinf()?

'''Checks for overflow in gradient across parallel process'''

def __init__(self, param_groups=None, mpu=None, zero_reduce_scatter=False, deepspeed=None):
def __init__(self, param_groups=None, mpu=None, zero_reduce_scatter=False, deepspeed=None, partition_grads=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing deepseed engine into a submodule is not a good design and create all sorts of cyclic reference issues. It is better to pass the specific attributes that are needed, such as enable_backward_allreduce

timers = self.timers if self.wall_clock_breakdown() else NoopTimer()
optimizer = BF16_Optimizer(optimizer,
self.param_names,
deepspeed=self,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't pass self into submodules.

self._setup_for_real_optimizer()

# Overflow check init
self.overflow = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should self.overflow be a class member since it seems to be only used once?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants