Skip to content

Add configurable reboot threshold time and skip reboot if status recently updated#13

Merged
hlts2 merged 2 commits intomainfrom
fix/node-reboot
Mar 4, 2025
Merged

Add configurable reboot threshold time and skip reboot if status recently updated#13
hlts2 merged 2 commits intomainfrom
fix/node-reboot

Conversation

@hlts2
Copy link
Copy Markdown
Member

@hlts2 hlts2 commented Mar 4, 2025

When running the current implementation, we encountered an issue where reboots were happening at a high frequency. In response to this, we received a request from the user to specify a time window of "about 40 minutes," and the implementation was modified to meet this requirement.

In this PR's implementation, when the check loop starts, it executes a reboot if there has been no status update within the default time window of -40 minutes. To address the issue caused by the previous behavior, we introduced a configurable time window to prevent unnecessary reboots.

The node judgment logic remains the same as before with two cases.

  • A node enters the NotReady state.
  • The number of available GPUs per node falls below a configured threshold.

When these cases are met, the node status update time is checked, and if the invalid status persists for a period of -40 minutes (which can be adjusted using environment variable), a reboot is triggered. This change is expected to resolve the issue of frequent reboots.

Signed-off-by: hlts2 <hiroto.funakoshi.hiroto@gmail.com>
@hlts2 hlts2 requested a review from jokestax March 4, 2025 11:21
@hlts2 hlts2 self-assigned this Mar 4, 2025
@hlts2 hlts2 marked this pull request as ready for review March 4, 2025 11:24
@hlts2 hlts2 requested a review from johndietz March 4, 2025 11:27
Comment thread pkg/watcher/watcher_test.go
Comment thread pkg/watcher/watcher_test.go
Comment thread pkg/watcher/watcher_test.go
Signed-off-by: hlts2 <hiroto.funakoshi.hiroto@gmail.com>
@hlts2 hlts2 requested a review from jokestax March 4, 2025 13:27
@hlts2
Copy link
Copy Markdown
Member Author

hlts2 commented Mar 4, 2025

@jokestax Thank you for your review 🙏 I will merge this PR 🚀

cc: @johndietz

@hlts2 hlts2 merged commit ec65045 into main Mar 4, 2025
@hlts2 hlts2 deleted the fix/node-reboot branch March 4, 2025 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants