What happened:
On some compute platforms workers are able to start and connect to the scheduler before they are set up with inbound networking themselves. This results in a race condition where a worker can register with the scheduler, but the scheduler and other workers are unable to connect to the worker to send work resulting in errors.
One concrete example of this would be when using Readiness Probes on Kubernetes. If probes are configured Kubernetes will not allow network connections into a pod until the probes have passed N times. This may mean that network into a worker is not enabled until a couple of seconds after it begins listening on its TCP address. But during those seconds it will register with the scheduler and the scheduler may try and send work within a few milliseconds.
What you expected to happen:
When the scheduler receives a new worker registration it should check the connectivity back is available before putting that worker into service.
Minimal Complete Verifiable Example:
This is a little hard to create an MCVE for.
Anything else we need to know?:
This bug was highlighted by @stephan-erb-by and @philipp-sontag-by when discussing the new operator in dask/dask-kubernetes#256.
What happened:
On some compute platforms workers are able to start and connect to the scheduler before they are set up with inbound networking themselves. This results in a race condition where a worker can register with the scheduler, but the scheduler and other workers are unable to connect to the worker to send work resulting in errors.
One concrete example of this would be when using Readiness Probes on Kubernetes. If probes are configured Kubernetes will not allow network connections into a pod until the probes have passed N times. This may mean that network into a worker is not enabled until a couple of seconds after it begins listening on its TCP address. But during those seconds it will register with the scheduler and the scheduler may try and send work within a few milliseconds.
What you expected to happen:
When the scheduler receives a new worker registration it should check the connectivity back is available before putting that worker into service.
Minimal Complete Verifiable Example:
This is a little hard to create an MCVE for.
Anything else we need to know?:
This bug was highlighted by @stephan-erb-by and @philipp-sontag-by when discussing the new operator in dask/dask-kubernetes#256.