Add support for workers that are referenced by host names rather than by ip addresses by darindf · Pull Request #2590 · dask/distributed

darindf · 2019-04-01T22:23:50Z

This patch has changes so that a worker that has contact-address and listener-addresses using dns hostname, rather than a ip address, can handle situations where the ip address of the host may change. The ip address maybe from getting a new dhcp lease with different ip address, or having the machine suspended and then awakened.

This means, that caching of dns names to ip addresses cannot be done, as hostnames always need to resolved against the dns server, as they may change.

For a short period of time the the workers ip address may be still listed in the scheduler, but eventually the scheduler will recognized that fact and expunge it (since the worker doesn't die to force the removal), while in the meantime the worker will attempt may register with the scheduler. Prior to the fix, the worker is refused as duplicate worker, and then is never able to rejoin the scheduler.

Note that the scheduler identifies workers by their ip endpoint, the aliases seems to be another caching mechanism to help facilitate/shortcut worker names into their respective end point ip addresses

mrocklin · 2019-04-02T00:54:57Z

    return old


-@toolz.memoize


This slightly concerns me. I've seen systems where DNS lookups are surprisingly expensive. Your approach here is probably safer long though, especially for longer running processes.

DNS caching also occurs at the OS level.

I'm not familiar with this decorator, but I do not see any mention of cache flushing to remove stale entries, as with the OS dns caches the usually have a TTL (time to live) to specify when they need to be re-queried.

With the current implementation, it appears that the @toolz.memozie will never flush the cache.

mrocklin · 2019-04-03T13:29:29Z

@quasiben if you have a moment can I ask you to take a look at this and share your opinion? Please also see #2593

quasiben · 2019-04-03T15:51:47Z

I have a preference for this approach. While #2593 is succinct, i don't think dask should be responsible for lookups and caching and instead using external DNS as this PR does. Additionally, @darindf says, DNS caching is a think -- it's also something which can be tuned as most expose DNS these kinds of configuration options.

Still getting a few errors in the implementation:
https://ci.appveyor.com/project/daskdev/distributed/builds/23521915#L1991

mrocklin · 2019-04-08T13:51:42Z

@darindf any response to @quasiben 's response above? I notice that you've continued pushing on the other implementation instead.

darindf · 2019-04-08T14:52:43Z

I tend to agree at this point of time, as the alternate implementation relies on the scheduler to evict the dead worker, and I'm not clear if that is configurable, where as the dead worker may never be evicted, or evicted after a long delay, i.e. by relying on the scheduler heartbeat mechanism.

mrocklin · 2019-04-12T01:51:29Z

Hrm, it looks like a bunch of other commits got pulled in here.

quasiben · 2019-04-18T18:06:58Z

Not sure why you are getting failing tests. Running locally worked for me. I would suggest fixing conflicts and updating the PR. Happy to work on this today and tomorrow if you have time @darindf

darindf · 2019-04-18T20:35:26Z

Just got the note, tomorrow is holiday for me, so I won't be able to touch it until Monday. Not sure why I got conflicts.

On Thu, Apr 18, 2019, at 11:56 AM, Benjamin Zaitlen wrote: Not sure why you are getting failing tests. Running locally worked for me. I would suggest fixing conflicts and updating the PR. Happy to work on this today and tomorrow if you have time @darindf <https://github.com/darindf> — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2590 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA3MAPUERE3Y3ZD7NEOGGRLPRC2ENANCNFSM4HC3A4SQ>.

<https://support.onlymyemail.com/view/report_spam/MTI2ODA3OjIxNjU3NDk5MzA6ZGFyaW5Ab21lZ2Fzb2Z0Lm9yZzpkZWxpdmVyZWQ>If this email is spam, report it towww.OnlyMyEmail.com <https://support.onlymyemail.com/view/report_spam/MTI2ODA3OjIxNjU3NDk5MzA6ZGFyaW5Ab21lZ2Fzb2Z0Lm9yZzpkZWxpdmVyZWQ>

mrocklin · 2019-04-18T20:50:49Z

Not sure why I got conflicts

We recently merged a large code style change. My apologies about this. Hopefully the merge conflicts should be mostly superficial.

…

On Thu, Apr 18, 2019 at 3:35 PM darindf ***@***.***> wrote: Just got the note, tomorrow is holiday for me, so I won't be able to touch it until Monday. Not sure why I got conflicts. On Thu, Apr 18, 2019, at 11:56 AM, Benjamin Zaitlen wrote: > Not sure why you are getting failing tests. Running locally worked for me. I would suggest fixing conflicts and updating the PR. Happy to work on this today and tomorrow if you have time @darindf < https://github.com/darindf> > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub < #2590 (comment)>, or mute the thread < https://github.com/notifications/unsubscribe-auth/AA3MAPUERE3Y3ZD7NEOGGRLPRC2ENANCNFSM4HC3A4SQ >. > > < https://support.onlymyemail.com/view/report_spam/MTI2ODA3OjIxNjU3NDk5MzA6ZGFyaW5Ab21lZ2Fzb2Z0Lm9yZzpkZWxpdmVyZWQ>If this email is spam, report it towww.OnlyMyEmail.com < https://support.onlymyemail.com/view/report_spam/MTI2ODA3OjIxNjU3NDk5MzA6ZGFyaW5Ab21lZ2Fzb2Z0Lm9yZzpkZWxpdmVyZWQ > — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2590 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AACKZTBYCH4VRQ2JGXH2JHDPRDLQ5ANCNFSM4HC3A4SQ> .

…dresses may change after network connection is established, such as being suspended and resumed.

darindf · 2019-04-23T20:34:44Z

@quasiben I believe this is ready. I have verified the code several times, not sure why lint is failing, nor why one of the tests, which seems completely unrelated to the changes I made.

quasiben · 2019-04-23T20:40:21Z

Looking now

quasiben · 2019-04-23T20:44:59Z

Not sure why the test is failing but the linting is due to black finding an issue with the code style. Recently, dask switched to using black. If you do the following it should clear everything up:

pip install black; black distributed --check

if you remove the --check black will apply the necessary changes

mrocklin · 2019-04-23T20:52:17Z

There are some intermittent testing failures today

…

On Tue, Apr 23, 2019 at 3:45 PM Benjamin Zaitlen ***@***.***> wrote: Not sure why the test is failing but the linting is due to black finding an issue with the code style. Recently, dask switched to using black <#2614>. If you do the following it should clear everything up: pip install black; black distributed --check if you remove the --check black will apply the necessary changes — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#2590 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AACKZTB3PC5QPPSIGAF4EATPR5YMXANCNFSM4HC3A4SQ> .

mrocklin · 2019-04-15T15:47:36Z

+                finally:
+                    self.heartbeat_active = False
+            else:
+                logger.debug("Heartbeat skipped: channel busy")


Why this change?

I was finding that while this method has a variable that it checks to see if it another thread is currently performing a heartbeat check, and thus issues the "channel busy" message and skipping the heartbeat check.

There is a race condition, i.e. two threads may be have have checked the if branch, and found that a heart beat check is necessary. This was found as there was multiple concurrent heart-beat messages on the scheduler from the same worker.

Ideally the register with scheduler should be forced to be single threaded resource with appropriate tests to prevent concurrent operation as well as tests to verify that the registration is still necessary.

I found that not only the periodic heart beat would could call register with scheduler in multiple threads, and if the network connection is broken, that handler would call register with scheduler, thus you could end up with multiple threads concurrently trying to register with the scheduler.

Then to top it off, if one thread successfully registered with scheduler, then the other thread would fail and throw an exception, which would disable the re-registration of the heart beat scheduler, as at the start of the register with scheduler, would disable the period heart beat, and the thrown exception would prevent re-registration.

mrocklin reviewed Apr 2, 2019

View reviewed changes

darindf mentioned this pull request Apr 2, 2019

Alternate implementation to support workers that are referenced by host names rather than by ip addresses #2593

Open

darindf force-pushed the master branch from d63fad3 to 5307648 Compare April 12, 2019 22:27

darindf added 3 commits April 22, 2019 08:52

Add support for workers that are referenced by hostnames, whose ip ad…

e15317c

…dresses may change after network connection is established, such as being suspended and resumed.

Resolve conflicts

8a32ff9

Cleanup

63c8977

darindf force-pushed the master branch 9 times, most recently from 70deac1 to 48941df Compare April 23, 2019 16:02

Fix acceptance

e1f0590

darindf force-pushed the master branch from 48941df to e1f0590 Compare April 23, 2019 16:53

mrocklin reviewed Apr 24, 2019

View reviewed changes

Base automatically changed from master to main March 8, 2021 19:03

darindf requested a review from fjetter as a code owner January 23, 2024 10:57

Uh oh!

Conversation

darindf commented Apr 1, 2019

Uh oh!

mrocklin Apr 2, 2019

Choose a reason for hiding this comment

Uh oh!

darindf Apr 2, 2019

Choose a reason for hiding this comment

Uh oh!

mrocklin commented Apr 3, 2019

Uh oh!

quasiben commented Apr 3, 2019

Uh oh!

mrocklin commented Apr 8, 2019

Uh oh!

darindf commented Apr 8, 2019

Uh oh!

mrocklin commented Apr 12, 2019

Uh oh!

quasiben commented Apr 18, 2019

Uh oh!

darindf commented Apr 18, 2019 via email

Uh oh!

mrocklin commented Apr 18, 2019 via email

Uh oh!

darindf commented Apr 23, 2019

Uh oh!

quasiben commented Apr 23, 2019

Uh oh!

quasiben commented Apr 23, 2019

Uh oh!

mrocklin commented Apr 23, 2019 via email

Uh oh!

mrocklin Apr 15, 2019

Choose a reason for hiding this comment

Uh oh!

darindf Apr 24, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants