-
-
Notifications
You must be signed in to change notification settings - Fork 757
Add support for workers that are referenced by host names rather than by ip addresses #2590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -24,7 +24,7 @@ | |
| from tornado.gen import Return | ||
| from tornado import gen | ||
| from tornado.ioloop import IOLoop | ||
| from tornado.locks import Event | ||
| from tornado.locks import Event, Lock | ||
|
|
||
| from . import profile, comm | ||
| from .batched import BatchedSend | ||
|
|
@@ -502,6 +502,7 @@ def __init__( | |
| self.scheduler_delay = 0 | ||
| self.stream_comms = dict() | ||
| self.heartbeat_active = False | ||
| self.heartbeat_lock = Lock() | ||
| self._ipython_kernel = None | ||
|
|
||
| if self.local_dir not in sys.path: | ||
|
|
@@ -711,7 +712,7 @@ def _register_with_scheduler(self): | |
| except gen.TimeoutError: | ||
| logger.info("Timed out when connecting to scheduler") | ||
| if response["status"] != "OK": | ||
| raise ValueError("Unexpected response from register: %r" % (response,)) | ||
| logger.warning("Unexpected response from register: %r" % (response,)) | ||
| else: | ||
| # Retrieve eventual init functions and run them | ||
| for function_bytes in response["worker-setups"]: | ||
|
|
@@ -734,30 +735,31 @@ def _register_with_scheduler(self): | |
|
|
||
| @gen.coroutine | ||
| def heartbeat(self): | ||
| if not self.heartbeat_active: | ||
| self.heartbeat_active = True | ||
| logger.debug("Heartbeat: %s" % self.address) | ||
| try: | ||
| start = time() | ||
| response = yield self.scheduler.heartbeat_worker( | ||
| address=self.contact_address, now=time(), metrics=self.get_metrics() | ||
| ) | ||
| end = time() | ||
| middle = (start + end) / 2 | ||
|
|
||
| if response["status"] == "missing": | ||
| yield self._register_with_scheduler() | ||
| return | ||
| self.scheduler_delay = response["time"] - middle | ||
| self.periodic_callbacks["heartbeat"].callback_time = ( | ||
| response["heartbeat-interval"] * 1000 | ||
| ) | ||
| except CommClosedError: | ||
| logger.warning("Heartbeat to scheduler failed") | ||
| finally: | ||
| self.heartbeat_active = False | ||
| else: | ||
| logger.debug("Heartbeat skipped: channel busy") | ||
| with (yield self.heartbeat_lock.acquire()): | ||
| if not self.heartbeat_active: | ||
| self.heartbeat_active = True | ||
| logger.debug("Heartbeat: %s" % self.address) | ||
| try: | ||
| start = time() | ||
| response = yield self.scheduler.heartbeat_worker( | ||
| address=self.contact_address, now=time(), metrics=self.get_metrics() | ||
| ) | ||
| end = time() | ||
| middle = (start + end) / 2 | ||
|
|
||
| if response["status"] == "missing": | ||
| yield self._register_with_scheduler() | ||
| return | ||
| self.scheduler_delay = response["time"] - middle | ||
| self.periodic_callbacks["heartbeat"].callback_time = ( | ||
| response["heartbeat-interval"] * 1000 | ||
| ) | ||
| except CommClosedError: | ||
| logger.warning("Heartbeat to scheduler failed") | ||
| finally: | ||
| self.heartbeat_active = False | ||
| else: | ||
| logger.debug("Heartbeat skipped: channel busy") | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why this change?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was finding that while this method has a variable that it checks to see if it another thread is currently performing a heartbeat check, and thus issues the "channel busy" message and skipping the heartbeat check. There is a race condition, i.e. two threads may be have have checked the if branch, and found that a heart beat check is necessary. This was found as there was multiple concurrent heart-beat messages on the scheduler from the same worker. Ideally the register with scheduler should be forced to be single threaded resource with appropriate tests to prevent concurrent operation as well as tests to verify that the registration is still necessary. I found that not only the periodic heart beat would could call register with scheduler in multiple threads, and if the network connection is broken, that handler would call register with scheduler, thus you could end up with multiple threads concurrently trying to register with the scheduler. Then to top it off, if one thread successfully registered with scheduler, then the other thread would fail and throw an exception, which would disable the re-registration of the heart beat scheduler, as at the start of the register with scheduler, would disable the period heart beat, and the thrown exception would prevent re-registration. |
||
|
|
||
| @gen.coroutine | ||
| def handle_scheduler(self, comm): | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This slightly concerns me. I've seen systems where DNS lookups are surprisingly expensive. Your approach here is probably safer long though, especially for longer running processes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DNS caching also occurs at the OS level.
I'm not familiar with this decorator, but I do not see any mention of cache flushing to remove stale entries, as with the OS dns caches the usually have a TTL (time to live) to specify when they need to be re-queried.
With the current implementation, it appears that the @toolz.memozie will never flush the cache.