Add support for workers that are referenced by host names rather than by ip addresses#2590
Add support for workers that are referenced by host names rather than by ip addresses#2590
Conversation
| return old | ||
|
|
||
|
|
||
| @toolz.memoize |
There was a problem hiding this comment.
This slightly concerns me. I've seen systems where DNS lookups are surprisingly expensive. Your approach here is probably safer long though, especially for longer running processes.
There was a problem hiding this comment.
DNS caching also occurs at the OS level.
I'm not familiar with this decorator, but I do not see any mention of cache flushing to remove stale entries, as with the OS dns caches the usually have a TTL (time to live) to specify when they need to be re-queried.
With the current implementation, it appears that the @toolz.memozie will never flush the cache.
|
I have a preference for this approach. While #2593 is succinct, i don't think dask should be responsible for lookups and caching and instead using external DNS as this PR does. Additionally, @darindf says, DNS caching is a think -- it's also something which can be tuned as most expose DNS these kinds of configuration options. Still getting a few errors in the implementation: |
|
I tend to agree at this point of time, as the alternate implementation relies on the scheduler to evict the dead worker, and I'm not clear if that is configurable, where as the dead worker may never be evicted, or evicted after a long delay, i.e. by relying on the scheduler heartbeat mechanism. |
|
Hrm, it looks like a bunch of other commits got pulled in here. |
|
Not sure why you are getting failing tests. Running locally worked for me. I would suggest fixing conflicts and updating the PR. Happy to work on this today and tomorrow if you have time @darindf |
|
Just got the note, tomorrow is holiday for me, so I won't be able to touch it until Monday. Not sure why I got conflicts.
On Thu, Apr 18, 2019, at 11:56 AM, Benjamin Zaitlen wrote:
Not sure why you are getting failing tests. Running locally worked for me. I would suggest fixing conflicts and updating the PR. Happy to work on this today and tomorrow if you have time @darindf <https://github.com/darindf>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#2590 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA3MAPUERE3Y3ZD7NEOGGRLPRC2ENANCNFSM4HC3A4SQ>.
|
|
Not sure why I got conflicts
We recently merged a large code style change. My apologies about this.
Hopefully the merge conflicts should be mostly superficial.
…On Thu, Apr 18, 2019 at 3:35 PM darindf ***@***.***> wrote:
Just got the note, tomorrow is holiday for me, so I won't be able to touch
it until Monday. Not sure why I got conflicts.
On Thu, Apr 18, 2019, at 11:56 AM, Benjamin Zaitlen wrote:
> Not sure why you are getting failing tests. Running locally worked for
me. I would suggest fixing conflicts and updating the PR. Happy to work on
this today and tomorrow if you have time @darindf <
https://github.com/darindf>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub <
#2590 (comment)>, or
mute the thread <
https://github.com/notifications/unsubscribe-auth/AA3MAPUERE3Y3ZD7NEOGGRLPRC2ENANCNFSM4HC3A4SQ
>.
>
>
<
https://support.onlymyemail.com/view/report_spam/MTI2ODA3OjIxNjU3NDk5MzA6ZGFyaW5Ab21lZ2Fzb2Z0Lm9yZzpkZWxpdmVyZWQ>If
this email is spam, report it towww.OnlyMyEmail.com <
https://support.onlymyemail.com/view/report_spam/MTI2ODA3OjIxNjU3NDk5MzA6ZGFyaW5Ab21lZ2Fzb2Z0Lm9yZzpkZWxpdmVyZWQ
>
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2590 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AACKZTBYCH4VRQ2JGXH2JHDPRDLQ5ANCNFSM4HC3A4SQ>
.
|
…dresses may change after network connection is established, such as being suspended and resumed.
70deac1 to
48941df
Compare
|
@quasiben I believe this is ready. I have verified the code several times, not sure why lint is failing, nor why one of the tests, which seems completely unrelated to the changes I made. |
|
Looking now |
|
Not sure why the test is failing but the linting is due to if you remove the |
|
There are some intermittent testing failures today
…On Tue, Apr 23, 2019 at 3:45 PM Benjamin Zaitlen ***@***.***> wrote:
Not sure why the test is failing but the linting is due to black finding
an issue with the code style. Recently, dask switched to using black
<#2614>. If you do the following
it should clear everything up:
pip install black; black distributed --check
if you remove the --check black will apply the necessary changes
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2590 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AACKZTB3PC5QPPSIGAF4EATPR5YMXANCNFSM4HC3A4SQ>
.
|
| finally: | ||
| self.heartbeat_active = False | ||
| else: | ||
| logger.debug("Heartbeat skipped: channel busy") |
There was a problem hiding this comment.
I was finding that while this method has a variable that it checks to see if it another thread is currently performing a heartbeat check, and thus issues the "channel busy" message and skipping the heartbeat check.
There is a race condition, i.e. two threads may be have have checked the if branch, and found that a heart beat check is necessary. This was found as there was multiple concurrent heart-beat messages on the scheduler from the same worker.
Ideally the register with scheduler should be forced to be single threaded resource with appropriate tests to prevent concurrent operation as well as tests to verify that the registration is still necessary.
I found that not only the periodic heart beat would could call register with scheduler in multiple threads, and if the network connection is broken, that handler would call register with scheduler, thus you could end up with multiple threads concurrently trying to register with the scheduler.
Then to top it off, if one thread successfully registered with scheduler, then the other thread would fail and throw an exception, which would disable the re-registration of the heart beat scheduler, as at the start of the register with scheduler, would disable the period heart beat, and the thrown exception would prevent re-registration.
This patch has changes so that a worker that has contact-address and listener-addresses using dns hostname, rather than a ip address, can handle situations where the ip address of the host may change. The ip address maybe from getting a new dhcp lease with different ip address, or having the machine suspended and then awakened.
This means, that caching of dns names to ip addresses cannot be done, as hostnames always need to resolved against the dns server, as they may change.
For a short period of time the the workers ip address may be still listed in the scheduler, but eventually the scheduler will recognized that fact and expunge it (since the worker doesn't die to force the removal), while in the meantime the worker will attempt may register with the scheduler. Prior to the fix, the worker is refused as duplicate worker, and then is never able to rejoin the scheduler.
Note that the scheduler identifies workers by their ip endpoint, the aliases seems to be another caching mechanism to help facilitate/shortcut worker names into their respective end point ip addresses