Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart workers when worker-ttl expires #8538

Merged
merged 2 commits into from Apr 15, 2024
Merged

Conversation

crusaderky
Copy link
Collaborator

@crusaderky crusaderky commented Feb 27, 2024

@@ -545,7 +545,7 @@ def __init__(
self._memory_unmanaged_old = 0
self._memory_unmanaged_history = deque()
self.metrics = {}
self.last_seen = 0
self.last_seen = time()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Align behaviour to ClientState

@crusaderky crusaderky force-pushed the worker_ttl branch 2 times, most recently from b1aaf4a to 7c20a11 Compare February 27, 2024 12:54
Copy link
Contributor

github-actions bot commented Feb 27, 2024

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    29 files  ± 0      29 suites  ±0   11h 19m 2s ⏱️ + 2m 34s
 4 065 tests + 3   3 952 ✅ + 4    109 💤 ±0  3 ❌  - 2  1 🔥 +1 
55 027 runs  +43  52 613 ✅ +45  2 410 💤 +1  3 ❌  - 4  1 🔥 +1 

For more details on these failures and errors, see this check.

Results for commit 2cee264. ± Comparison against base commit 6be418d.

♻️ This comment has been updated with latest results.

@crusaderky crusaderky force-pushed the worker_ttl branch 4 times, most recently from e24f463 to d1f6231 Compare March 4, 2024 18:33
@crusaderky crusaderky changed the title Try restarting workers when worker-ttl expires Restart workers when worker-ttl expires Mar 4, 2024
@crusaderky crusaderky marked this pull request as ready for review March 4, 2024 18:37
@crusaderky crusaderky requested a review from fjetter as a code owner March 4, 2024 18:37
@crusaderky crusaderky marked this pull request as draft March 5, 2024 13:53
@crusaderky crusaderky self-assigned this Mar 5, 2024
@crusaderky crusaderky marked this pull request as ready for review March 5, 2024 14:06
@crusaderky crusaderky marked this pull request as draft March 5, 2024 14:44
@crusaderky crusaderky marked this pull request as ready for review March 5, 2024 15:41
@crusaderky crusaderky force-pushed the worker_ttl branch 2 times, most recently from a4ce685 to b213688 Compare March 5, 2024 16:21
@phofl
Copy link
Collaborator

phofl commented Mar 7, 2024

Will this restart the worker immediately or when the thing that is running releases the Gil again/finishes?

What I want to know: If my program holds the Gil and thus the 300s timeout hits, do we restart the worker before my program finishes?

@crusaderky
Copy link
Collaborator Author

Will this restart the worker immediately or when the thing that is running releases the Gil again/finishes?

What I want to know: If my program holds the Gil and thus the 300s timeout hits, do we restart the worker before my program finishes?

Immediately. The nanny will first try gracefully sending an {op: close} message to the client and then, if there's no answer for 5 seconds, SIGKILL it.

@crusaderky crusaderky force-pushed the worker_ttl branch 4 times, most recently from 4fc3483 to 4a0373f Compare March 12, 2024 12:59
await self.remove_worker(address=ws.address, stimulus_id=stimulus_id)

if to_restart:
await self.restart_workers(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you briefly check that the tracking in https://github.com/coiled/platform/blob/4dbd6f449884464caaba09b470aa06394a22d024/analytics/preload_scripts/telemetry.py#L772 still works? I don't see a reason why not, but want to be sure

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afraid I can no longer do that. Is there anything else I can do to push this PR through the finishing line?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should still work from what I can tell. Closing workers are a bit of a brittle thing, though, so it's not impossible that there is some kind of race condition ongoing where that condition would no longer work. If that's the case, we can look into it later

@fjetter
Copy link
Member

fjetter commented Apr 15, 2024

I merged main. Assuming CI is not horribly broken, we can merge

@crusaderky crusaderky merged commit 587d472 into dask:main Apr 15, 2024
29 of 34 checks passed
@crusaderky crusaderky deleted the worker_ttl branch April 15, 2024 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

worker-ttl timeout should attempt a nanny restart
3 participants