Skip to content

Conversation

@T1mn
Copy link
Contributor

@T1mn T1mn commented Jan 10, 2026

Handle empty buffers in compressed allreduce by early-return and clearing error buffers to avoid NaNs and needless communication.

@T1mn T1mn requested a review from GuanhuaWang as a code owner January 10, 2026 13:14
When the buffer has numel==0, scaling divides by sqrt(0) and the\npack/unpack path becomes undefined. Early-return in the compressed\nallreduce backends and clear error buffers to avoid NaNs and useless\ncommunication.\n\nTest plan:\n- Not run (torch not available in this environment)

Signed-off-by: T1mn <136770748@qq.com>
@T1mn T1mn force-pushed the fix/comm-empty-compressed-allreduce branch from 8cfe2c8 to 84be831 Compare January 10, 2026 13:25
@tohtana
Copy link
Collaborator

tohtana commented Jan 11, 2026

Hi @T1mn, thank you for your fix! As we need the same code in four files, can you create deepspeed/runtime/comm/utils.py and define a common logic to avoid duplicated code. You can have something like:

 def check_and_handle_empty_buffer(
      buffer_m: torch.Tensor,
      original_shape: torch.Size,
      original_size: int,
      worker_error: torch.Tensor,
      server_error: torch.Tensor,
  ) -> Optional[torch.Tensor]:
      if original_size == 0:
          if worker_error.numel():
              worker_error.zero_()
          if server_error.numel():
              server_error.zero_()
          if len(original_shape) > 1:
              return buffer_m.reshape(original_shape)
          return buffer_m
      return None

Then you can do

      result = check_and_handle_empty_buffer(
          buffer_m, original_shape, original_size, worker_error, server_error
      )
      if result is not None:
          return result

@T1mn T1mn force-pushed the fix/comm-empty-compressed-allreduce branch from 725f0ed to 94a093d Compare January 12, 2026 02:53
@T1mn
Copy link
Contributor Author

T1mn commented Jan 12, 2026

Hi @tohtana , thanks for the detailed suggestion. I’ve followed your guidance and extracted the empty-buffer handling into deepspeed/runtime/comm/utils.py as check_and_handle_empty_buffer(...). The four backends now call this helper and return early when it produces a result, so the logic is centralized and consistent across files.
Please let me know if you’d like the helper placed elsewhere or want any additional adjustments.

@tohtana
Copy link
Collaborator

tohtana commented Jan 12, 2026

Thank you @T1mn! The looks good. Can you fix formatting?

@T1mn T1mn force-pushed the fix/comm-empty-compressed-allreduce branch from 94a093d to a9bcbb3 Compare January 12, 2026 06:44
factor the empty-buffer early return into a small helper so the four backends stay consistent.

Signed-off-by: T1mn <136770748@qq.com>
@T1mn T1mn force-pushed the fix/comm-empty-compressed-allreduce branch from a9bcbb3 to 2c6e42c Compare January 12, 2026 07:02
@T1mn
Copy link
Contributor Author

T1mn commented Jan 12, 2026

I’ve run formatting and pushed the updated commit. Please take another look when you have a chance.

Thank you @T1mn! The looks good. Can you fix formatting?

@T1mn
Copy link
Contributor Author

T1mn commented Jan 15, 2026

Hi @tohtana , I may have missed some formatting details, I ran the formatter and pushed the update.
If you spot anything else to tweak, please let me know. Appreciate your time!

@tohtana tohtana enabled auto-merge (squash) January 15, 2026 07:50
@sfc-gh-truwase sfc-gh-truwase requested review from tohtana and removed request for GuanhuaWang January 15, 2026 15:48
@tohtana tohtana merged commit 97e7430 into deepspeedai:master Jan 15, 2026
12 of 13 checks passed
phalani-paladugu pushed a commit to phalani-paladugu/DeepSpeed that referenced this pull request Jan 29, 2026
Handle empty buffers in compressed allreduce by early-return and
clearing error buffers to avoid NaNs and needless communication.

---------

Signed-off-by: T1mn <136770748@qq.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Phalani Paladugu <mailofphalani@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants