[Discussion] Structured concurrency

This issue is meant for higher-level discussion around dropping Tornado and our async infrastructure https://github.com/dask/distributed/issues/6047.

I count 42 instances of `loop.add_callback` or `loop.call_later` in distributed. That's equivalent to saying we have 42 `goto` statements in distributed.

Unfamiliar with this comparison? Please read https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful. (This whole issue is really just saying "please read this". If you've read it, read it again. It took me a few reads for the ideas to really sink in.)

The basic idea is that when you use `add_callback`, or `asyncio.create_task` (or `go` in golang, etc.), you're starting something to run concurrently, but not dealing with the endgame:
- what if the callback raises an error?
- what if an error occurs somewhere else and you now need to cancel the callback?
- how do you even know when the callback is done?

This _can_ be fine. But because you can just fire off callbacks willy-nilly and not deal with their consequences or lifetimes, it's very hard to reason about and maintain code that does this. (For the same reason code that uses `goto` statements can work, but is hard to reason about and maintain.) When things go as expected, they probably work, but when anything unexpected happens, it's very easy to go off the rails. Of course, a distributed system is just the kind of system where unexpected things are nearly guaranteed to happen, and it's unacceptable to not handle them correctly when they do.

Here are some recent issues that I think, at their core, stem from using unstructured concurrency. That is, using structured concurrency, we'd either not (be likely/able to) design something that could get into this sort of broken state, or an unhandled error would propagate up and shut down the whole worker, instead of ignoring the error and hobbling along doing the wrong thing as though everything's fine:
- https://github.com/dask/distributed/issues/5958 (exception not propagated in `execute`)
- https://github.com/dask/distributed/issues/6159 (can't cancel `gather_dep` coroutine)
- https://github.com/dask/distributed/issues/6142 (can't propagate exception from `execute`)
- https://github.com/dask/distributed/issues/5481 (coroutine assumed running when it's failed; exception not propagated)
- https://github.com/dask/distributed/issues/6194 (exception not propagated)
- https://github.com/dask/distributed/issues/5366 (exception not propagated?)
- https://github.com/dask/distributed/issues/6107 (contextvars mishandled in async framework)
- https://github.com/dask/distributed/issues/5955 ([handling shutdown mess](https://vorpus.org/blog/control-c-handling-in-python-and-trio/))
- https://github.com/dask/distributed/issues/6087 ([handling shutdown mess](https://vorpus.org/blog/control-c-handling-in-python-and-trio/))
- I'll also include the recent things involving `gather_dep`, since I think not being able to cancel or respond to errors with in-progress `gather_dep` callbacks is the core reason it's so brittle (https://github.com/dask/distributed/issues/6159#issuecomment-1104141588). And using structured concurrency, it would be awkward/non-idiomatic to implement something like our current design.
  - https://github.com/dask/distributed/issues/6125
  - https://github.com/dask/distributed/issues/5951
  - https://github.com/dask/distributed/issues/6191

-----

I'm not saying "we need to rewrite all of distributed to use trio right now!" I recognize this isn't going to happen immediately.

But I think we should at least take the concepts of structured concurrency to heart. [TaskGroups are getting added to asyncio in py3.11](https://twitter.com/1st1/status/1493748843430567942), and are [available now in aiotools](https://aiotools.readthedocs.io/en/latest/aiotools.taskgroup.html) and [anyio](https://anyio.readthedocs.io/en/stable/tasks.html). When using concurrency, we should always:
- Have a way for the coroutine to propagate errors upwards
- Have a way to cancel the coroutine

This is kinda possible to do just with plain asyncio... it's just that to do it easily and consistently and _reliably_ (in the face of signal handlers, `__del__` methods, exceptions in `finally` blocks, etc.) you'd start wishing you were using trio.

Also, switching to using trio/anyio may not be as dramatic as you'd think. Trio can be hard to adopt because it doesn't support asyncio-based libraries (though anyio can help with this). But good news—we're not using any asyncio-based libraries (besides Tornado)!

And we might not have to immediately rewrite every `loop.add_callback` to do structured concurrency "properly" (an `async with trio.nursery()` block, etc.). Trio's [clever "escape hatch"](https://vorpus.org/blog/notes-on-structured-concurrency-or-go-statement-considered-harmful/#there-is-an-escape) means we could just store a global `Nursery` object on the `Worker`/`Scheduler`/`Client` instance, and use it a lot like we use the current event loop. Replace `self.loop.add_callback -> self.nursery.start_soon` and we're in a strictly better state than we were before, because now previously-unhandled exceptions (aka fatal errors) in callbacks will propagate and cleanly shut down the whole worker/scheduler/client. Dramatic, but _correct_, and better than deadlocking. And then we can go gradually better-structure these places once the tools are available.

But, that said... structured concurrency is so fundamentally different from the traditional horrid mess of callbacks that there's debate on anyio around whether to even write a tutorial on how to port asyncio libraries: https://github.com/agronholm/anyio/issues/245#issuecomment-809976258. It's considered unlikely that trio's advances could be [brought back](https://vorpus.org/blog/timeouts-and-cancellation-for-humans/#asyncio) [to asyncio](https://vorpus.org/blog/some-thoughts-on-asynchronous-api-design-in-a-post-asyncawait-world/#should-asyncio-be-fixed-to-have-a-curio-style-async-await-native-api). It's probably reasonable to expect that adopting structured concurrency would, eventually, mean many of our existing systems would get re-architected. But if we can do that incrementally (which we probably can with anyio), I think we'd end up in a much more reliable and maintainable state than it would be possible to have with the current unstructured callbacks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Discussion] Structured concurrency #6201

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Discussion] Structured concurrency #6201

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions