Skip to content

Alternate implementation to support workers that are referenced by host names rather than by ip addresses#2593

Open
darindf wants to merge 2 commits intodask:mainfrom
darindf:take2
Open

Alternate implementation to support workers that are referenced by host names rather than by ip addresses#2593
darindf wants to merge 2 commits intodask:mainfrom
darindf:take2

Conversation

@darindf
Copy link
Copy Markdown
Contributor

@darindf darindf commented Apr 2, 2019

As the title says, this is alternate implementation for the pull request #2590.

In this variation, there are some subtle changes. The scheduler aliases is checked prior to the worker being registered so that if that fails, the worker remains unregistered with the scheduler so that latter attempts to yield "worker is already registered" when in fact it didn't fully register.

Another change is to not use alias for address resolution (coerce_address) as worker would always be referenced to the first registration ip address.

To support this, the worker attempt to register with the scheduler must be robust enough to fail registration so that re-attempts can be made. Currently a scheduler failure during worker registration, will forever prevents the worker from registering with the scheduler. In my testing, hostname lookup via ensure_ip, failed as dns hadn't propagated, thus the worker become zombied, still running but no longer registering with the scheduler.

  • worker ip address changes and continues running
  • worker immediately detects this, existing tcp connection is dropped, and worker tries to register with scheduler using new tcp connection
  • the scheduler refuses to register this worker, seeing it is currently defined, i.e. worker name is contained in aliases
  • these 2 steps may be repeated several times, worker attempts registration, scheduler refuses registration
  • at some point, scheduler detects that the existing worker, using the old ip address, is no longer reachable, and it is evicted from the scheduler (removed from workers and aliases).
  • worker attempts again to register with scheduler, this time the scheduler allows registration as the worker "name" is no longer present.

… may resolve to different ip addresses during the worker lifespan.
@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Apr 3, 2019

@quasiben if you have a moment can I ask you to take a look at this and share your opinion?

Base automatically changed from master to main March 8, 2021 19:03
@darindf darindf requested a review from fjetter as a code owner January 23, 2024 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants