Skip to content

Add support for workers that are referenced by host names rather than by ip addresses#2590

Open
darindf wants to merge 4 commits intodask:mainfrom
darindf:master
Open

Add support for workers that are referenced by host names rather than by ip addresses#2590
darindf wants to merge 4 commits intodask:mainfrom
darindf:master

Conversation

@darindf
Copy link
Copy Markdown
Contributor

@darindf darindf commented Apr 1, 2019

This patch has changes so that a worker that has contact-address and listener-addresses using dns hostname, rather than a ip address, can handle situations where the ip address of the host may change. The ip address maybe from getting a new dhcp lease with different ip address, or having the machine suspended and then awakened.

This means, that caching of dns names to ip addresses cannot be done, as hostnames always need to resolved against the dns server, as they may change.

For a short period of time the the workers ip address may be still listed in the scheduler, but eventually the scheduler will recognized that fact and expunge it (since the worker doesn't die to force the removal), while in the meantime the worker will attempt may register with the scheduler. Prior to the fix, the worker is refused as duplicate worker, and then is never able to rejoin the scheduler.

Note that the scheduler identifies workers by their ip endpoint, the aliases seems to be another caching mechanism to help facilitate/shortcut worker names into their respective end point ip addresses

Comment thread distributed/utils.py
return old


@toolz.memoize
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This slightly concerns me. I've seen systems where DNS lookups are surprisingly expensive. Your approach here is probably safer long though, especially for longer running processes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DNS caching also occurs at the OS level.

I'm not familiar with this decorator, but I do not see any mention of cache flushing to remove stale entries, as with the OS dns caches the usually have a TTL (time to live) to specify when they need to be re-queried.

With the current implementation, it appears that the @toolz.memozie will never flush the cache.

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Apr 3, 2019

@quasiben if you have a moment can I ask you to take a look at this and share your opinion? Please also see #2593

@quasiben
Copy link
Copy Markdown
Member

quasiben commented Apr 3, 2019

I have a preference for this approach. While #2593 is succinct, i don't think dask should be responsible for lookups and caching and instead using external DNS as this PR does. Additionally, @darindf says, DNS caching is a think -- it's also something which can be tuned as most expose DNS these kinds of configuration options.

Still getting a few errors in the implementation:
https://ci.appveyor.com/project/daskdev/distributed/builds/23521915#L1991

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Apr 8, 2019

@darindf any response to @quasiben 's response above? I notice that you've continued pushing on the other implementation instead.

@darindf
Copy link
Copy Markdown
Contributor Author

darindf commented Apr 8, 2019

I tend to agree at this point of time, as the alternate implementation relies on the scheduler to evict the dead worker, and I'm not clear if that is configurable, where as the dead worker may never be evicted, or evicted after a long delay, i.e. by relying on the scheduler heartbeat mechanism.

@mrocklin
Copy link
Copy Markdown
Member

Hrm, it looks like a bunch of other commits got pulled in here.

@quasiben
Copy link
Copy Markdown
Member

Not sure why you are getting failing tests. Running locally worked for me. I would suggest fixing conflicts and updating the PR. Happy to work on this today and tomorrow if you have time @darindf

@darindf
Copy link
Copy Markdown
Contributor Author

darindf commented Apr 18, 2019 via email

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Apr 18, 2019 via email

darindf added 3 commits April 22, 2019 08:52
…dresses may change after network connection is established, such as being suspended and resumed.
@darindf darindf force-pushed the master branch 9 times, most recently from 70deac1 to 48941df Compare April 23, 2019 16:02
@darindf
Copy link
Copy Markdown
Contributor Author

darindf commented Apr 23, 2019

@quasiben I believe this is ready. I have verified the code several times, not sure why lint is failing, nor why one of the tests, which seems completely unrelated to the changes I made.

@quasiben
Copy link
Copy Markdown
Member

Looking now

@quasiben
Copy link
Copy Markdown
Member

Not sure why the test is failing but the linting is due to black finding an issue with the code style. Recently, dask switched to using black. If you do the following it should clear everything up:

pip install black; black distributed --check

if you remove the --check black will apply the necessary changes

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Apr 23, 2019 via email

Comment thread distributed/worker.py
finally:
self.heartbeat_active = False
else:
logger.debug("Heartbeat skipped: channel busy")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was finding that while this method has a variable that it checks to see if it another thread is currently performing a heartbeat check, and thus issues the "channel busy" message and skipping the heartbeat check.

There is a race condition, i.e. two threads may be have have checked the if branch, and found that a heart beat check is necessary. This was found as there was multiple concurrent heart-beat messages on the scheduler from the same worker.

Ideally the register with scheduler should be forced to be single threaded resource with appropriate tests to prevent concurrent operation as well as tests to verify that the registration is still necessary.

I found that not only the periodic heart beat would could call register with scheduler in multiple threads, and if the network connection is broken, that handler would call register with scheduler, thus you could end up with multiple threads concurrently trying to register with the scheduler.

Then to top it off, if one thread successfully registered with scheduler, then the other thread would fail and throw an exception, which would disable the re-registration of the heart beat scheduler, as at the start of the register with scheduler, would disable the period heart beat, and the thrown exception would prevent re-registration.

Base automatically changed from master to main March 8, 2021 19:03
@darindf darindf requested a review from fjetter as a code owner January 23, 2024 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants