Skip to content

Conversation

@un-def
Copy link
Collaborator

@un-def un-def commented Mar 3, 2025

If any GPU has GPU util below threshold in all samples in a time window, kill the job (and, consequently, the run)

Closes: #2374

If any GPU has GPU util below threshold in all samples in a time window,
kill the job (and, consequently, the run)

Closes: #2374
@un-def un-def requested a review from r4victor March 3, 2025 15:37
if _should_terminate_due_to_low_gpu_util(
policy.min_gpu_utilization, [m.values for m in gpus_util_metrics]
):
logger.debug("%s: GPU utilization check: terminating", fmt(job_model))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd log it with info level.

@un-def un-def merged commit 6e438c4 into master Mar 4, 2025
24 checks passed
@un-def un-def deleted the issue_2374_utilization_policy branch March 4, 2025 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Job termination based on GPU utilization

3 participants