Skip to content

Dev global rank#313

Merged
wiederm merged 6 commits intomainfrom
dev-global-rank
Nov 7, 2024
Merged

Dev global rank#313
wiederm merged 6 commits intomainfrom
dev-global-rank

Conversation

@wiederm
Copy link
Copy Markdown
Member

@wiederm wiederm commented Nov 7, 2024

Pull Request Summary

This PR addresses an issue introduced in PR #308 where logging of the epoch time was intended to occur only on the main process (rank == 0). However, this led to a synchronization error that caused training to freeze after the first epoch. To resolve this, we adjusted the logging to ensure it only occurs on the main process without disrupting the training flow.

Key changes

  • Adjusted Logging Conditions: Updated specific logging calls to include rank_zero_only=True where necessary. This change ensures that logging occurs only on the main process without causing synchronization issues across processes.
  • Removed Redundant Logging Checks: Refactored the logging of certain metrics to avoid redundant if self.global_rank == 0 checks by using rank_zero_only directly in self.log() calls.
  • Ensured Sync Consistency: For logging operations that require distribution synchronization (e.g., gradients, learning rate), we used sync_dist=True for safe distributed handling, while maintaining rank_zero_only=True for process-specific logging.

Associated Issue(s)

Pull Request Checklist

  • Issue(s) raised/addressed and linked
  • Includes appropriate unit test(s)
  • Appropriate docstring(s) added/updated
  • Appropriate .rst doc file(s) added/updated
  • PR is ready for review

@wiederm wiederm self-assigned this Nov 7, 2024
@wiederm wiederm linked an issue Nov 7, 2024 that may be closed by this pull request
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Nov 7, 2024

Codecov Report

Attention: Patch coverage is 87.50000% with 1 line in your changes missing coverage. Please review.

Project coverage is 85.48%. Comparing base (c1430f2) to head (8b23f18).

Additional details and impacted files

@wiederm wiederm merged commit fa4814f into main Nov 7, 2024
@wiederm wiederm deleted the dev-global-rank branch November 7, 2024 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Synchronization Error with Epoch Time Logging

2 participants