Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-interpret error in lost worker scenario #6193

Merged
merged 1 commit into from
Apr 26, 2022

Conversation

mrocklin
Copy link
Member

Previously when a worker would die unexpectedly
(such as with a sys.exit call)
the nanny's multiprocessing.Queue to that worker would get a strange
TypeError because the process went away.
We now identify and interpret this error message and tell the nanny to clean up accordingly.

Previously when a worker would die unexpectedly
(such as with a sys.exit call)
the nanny's multiprocessing.Queue to that worker would get a strange
TypeError because the process went away.
We now identify and interpret this error message and tell the nanny to
clean up accordingly.
@mrocklin
Copy link
Member Author

Merging in 24 hours if there are no comments

loop.add_callback(do_stop, **msg)
try:
msg = child_stop_q.get()
except (TypeError, OSError):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example traceback of the TypeError. I'm a bit uncomfortable as the docs indicate that this shouldn't be thrown at all.

get([block[, timeout]])
Remove and return an item from the queue. If optional args block is True (the default) and timeout is None (the default), block if necessary until an item is available. If timeout is a positive number, it blocks at most timeout seconds and raises the queue.Empty exception if no item was available within that time. Otherwise (block is False), return an item if one is immediately available, else raise the queue.Empty exception (timeout is ignored in that case).
Changed in version 3.8: If the queue is closed, ValueError is raised instead of OSError.

Additionally, it's probably worth catching ValueError due to the exception change in Python 3.8.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grrrr, this stackoverflow comment was deleted. https://stackoverflow.com/questions/66886004/typeerror-when-getting-content-from-queue

The google cache of the above mentions something like the following:

  File "listener.py", line 77, in <module>
    print(parse_queue())
  File "listener.py", line 68, in parse_queue
    for item in Drainer(queue_res):
  File "listener.py", line 18, in __iter__
    yield self.q.get(False, 0.6)
  File "<string>", line 2, in get
  File "/usr/lib/python3.6/multiprocessing/managers.py", line 756, in _callmethod
    conn.send((self._id, methodname, args, kwds))
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 404, in _send_bytes
    self._send(header + buf)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
TypeError: an integer is required (got type NoneType)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is the situation you're running into if processes are dying during sys.exit

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, FWIW, I wouldn't spend too much time digging in here. This is a very very small part of the codebase and not a critical issue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was getting OSError in Python 3.10. I haven't seen ValueError in practice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I think I see. For an OSError, I guess there's distinction between:

  1. a formal queue close and
  2. the IPC socket closing as reality collapses around the Python interpreter due to a sys.exit?

Just FYI, I was thinking of the first case, which has now been replaced by ValueError https://github.com/python/cpython/pull/9010/files.

@github-actions
Copy link
Contributor

Unit Test Results

       16 files  +       4         16 suites  +4   7h 17m 3s ⏱️ + 2h 23m 3s
  2 728 tests +       3    2 646 ✔️ +     11       81 💤  -     9  1 +1 
21 709 runs  +5 403  20 670 ✔️ +5 142  1 038 💤 +260  1 +1 

For more details on these failures, see this check.

Results for commit 98c94ff. ± Comparison against base commit 6ad89e9.

@mrocklin mrocklin merged commit fb49514 into dask:main Apr 26, 2022
@mrocklin mrocklin deleted the nanny-child-stop-q branch April 26, 2022 13:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants