Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join Nanny watch thread #6146

Merged
merged 1 commit into from
Apr 18, 2022
Merged

Join Nanny watch thread #6146

merged 1 commit into from
Apr 18, 2022

Conversation

mrocklin
Copy link
Member

I hope that this helps to avoid interrmittent failures like the
following:

E               AssertionError: (<Thread(AsyncProcess Dask Worker process (from Nanny) watch process join, started daemon 123145591922688)>, ['  File ...3.8/multiprocessing/popen_fork.py", line 47, in wait
E                 \treturn self.poll(os.WNOHANG if timeout == 0.0 else 0)
E                 ', ...])
E               assert False

https://github.com/dask/distributed/runs/6048484591?check_suite_focus=true

  • Closes #xxxx
  • Tests added / passed
  • Passes pre-commit run --all-files

I hope that this helps to avoid interrmittent failures like the
following:

```
E               AssertionError: (<Thread(AsyncProcess Dask Worker process (from Nanny) watch process join, started daemon 123145591922688)>, ['  File ...3.8/multiprocessing/popen_fork.py", line 47, in wait
E                 \treturn self.poll(os.WNOHANG if timeout == 0.0 else 0)
E                 ', ...])
E               assert False
```

https://github.com/dask/distributed/runs/6048484591?check_suite_focus=true
@github-actions
Copy link
Contributor

Unit Test Results

       16 files  ±0         16 suites  ±0   7h 59m 12s ⏱️ + 13m 0s
  2 743 tests ±0    2 663 ✔️ ±0       80 💤 ±0  0 ±0 
21 829 runs  ±0  20 794 ✔️  - 1  1 035 💤 +1  0 ±0 

Results for commit beb1fed. ± Comparison against base commit f0e9f89.

with suppress(ValueError):
child_stop_q.close() # probably redundant
child_stop_q.join_thread()
thread.join(timeout=2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty happy that these are introduced from the POV that there should be explicit attempts to cleanup resources. Even more so if tornado.IOLoop.run_sync can raise errors other than TimeoutError and KeyboardInterrupt.

I would suggest removal of # probably redundant

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, they probably are redundant. It's just that occasionally they're not. The failure here is very rare. I'd like to call that out.

n = await Nanny(s.address, nthreads=2, loop=s.loop)
while len(s.workers) < 3:
await asyncio.sleep(0.1)
async with Nanny(s.address, nthreads=2) as n:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit for my own comprehension -- can we always assume that s.loop is IOLoop.current()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside any async context, yes.

assert not a.has_what.get(n_worker_address)
assert not any(
n_worker_address in s for ts in a.tasks.values() for s in ts.who_has
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A local vscode diff indicates that these changes are merely indents (github doesn't do such a good job here).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup.

@mrocklin mrocklin merged commit a5cc5e0 into dask:main Apr 18, 2022
@mrocklin mrocklin deleted the nanny-cleanup branch April 18, 2022 23:25
gjoseph92 added a commit to gjoseph92/distributed that referenced this pull request May 13, 2022
Introduced in dask#6146. We don't know why this code path is only being triggered now.
gjoseph92 added a commit to gjoseph92/distributed that referenced this pull request May 17, 2022
Introduced in dask#6146. We don't know why this code path is only being triggered now.
@gjoseph92 gjoseph92 mentioned this pull request May 17, 2022
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants