Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add keep-alive message between worker and scheduler #2907

Merged
merged 5 commits into from Aug 2, 2019

Conversation

@mrocklin
Copy link
Member

commented Jul 29, 2019

This is effectively a heartbeat, but much simpler and less frequent than
our current heartbeats

Fixes #2524

Add keep-alive message between worker and scheduler
This is effectively a heartbeat, but much simpler and less frequent than
our current heartbeats

Fixes #2524
@lr4d

This comment has been minimized.

Copy link

commented Aug 1, 2019

Getting this error on the scheduler side:

2019-08-01 13:30:48,297 ERROR    <lambda>() got an unexpected keyword argument 'worker' (distributed.core)
Traceback (most recent call last):
  File "/mnt/mesos/sandbox/venv/lib/python3.6/site-packages/distributed/core.py", line 477, in handle_stream
    handler(**merge(extra, msg))
TypeError: <lambda>() got an unexpected keyword argument 'worker'
@mrocklin

This comment has been minimized.

Copy link
Member Author

commented Aug 1, 2019

Thanks @lr4d . Handled

Also, what is a good time for the frequency here? Every minute? Every ten minutes? Every hour?

@lr4d

This comment has been minimized.

Copy link

commented Aug 1, 2019

I'd keep it below 5-10 minutes. For HAProxy the treshold for killing a connection on which no data is sent is 60 minutes, but I don't know what this may be for similar tools.

@lr4d

This comment has been minimized.

Copy link

commented Aug 2, 2019

This is appears to be working fine now for our HAProxy setup, I left the cluster alive for 6 hours and no disconnections or error messages took place.
Thanks @mrocklin

@mrocklin mrocklin merged commit 4dc3d19 into dask:master Aug 2, 2019

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@mrocklin mrocklin deleted the mrocklin:scheduler-worker-keep-alive branch Aug 2, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.