Skip to content

Check worker connectivity before sending work #5934

@jacobtomlinson

Description

@jacobtomlinson

What happened:

On some compute platforms workers are able to start and connect to the scheduler before they are set up with inbound networking themselves. This results in a race condition where a worker can register with the scheduler, but the scheduler and other workers are unable to connect to the worker to send work resulting in errors.

One concrete example of this would be when using Readiness Probes on Kubernetes. If probes are configured Kubernetes will not allow network connections into a pod until the probes have passed N times. This may mean that network into a worker is not enabled until a couple of seconds after it begins listening on its TCP address. But during those seconds it will register with the scheduler and the scheduler may try and send work within a few milliseconds.

What you expected to happen:

When the scheduler receives a new worker registration it should check the connectivity back is available before putting that worker into service.

Minimal Complete Verifiable Example:

This is a little hard to create an MCVE for.

Anything else we need to know?:

This bug was highlighted by @stephan-erb-by and @philipp-sontag-by when discussing the new operator in dask/dask-kubernetes#256.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething is broken

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions