Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promote {defer,allow}_cancellation as first class constructs #149

Closed
imrn opened this issue Dec 27, 2016 · 39 comments
Closed

Promote {defer,allow}_cancellation as first class constructs #149

imrn opened this issue Dec 27, 2016 · 39 comments

Comments

@imrn
Copy link

imrn commented Dec 27, 2016

Currently they are documented as a single line
in "Low-level Kernel System Calls" section.
I think they can be documented in "Tasks"
section with some examples.

They have direct everyday usage to protect a block
of code from cancellation in the middle or having
cancellation occur at a favorable point.

One example would be resetting a http2 stream
only at frame boundries.

@dabeaz
Copy link
Owner

dabeaz commented Dec 27, 2016

Yes, this needs better documentation. On it.

@dabeaz
Copy link
Owner

dabeaz commented Dec 29, 2016

A more general question. Would it make sense for defer_cancellation() to behave more like the timeout_after() function? For example:

result = await defer_cancellation(coro(args))

Or as a context manager::

async with defer_cancellation():
       statements
       ...

Just thinking out loud...

@imrn
Copy link
Author

imrn commented Dec 29, 2016 via email

@njsmith
Copy link
Contributor

njsmith commented Dec 29, 2016 via email

@dabeaz
Copy link
Owner

dabeaz commented Dec 29, 2016

The nesting of (defer|allow)_cancellation seems particularly tricky to me. Consider this example::

async def coro():
        async with defer_cancellation:
                statements
                ...
                async with allow_cancellation:
                        await some_blocking()

async def sometask():
        async with defer_cancellation:
                await coro()

Does the allow_cancellation block actually allow the delivery of a cancellation or not to the some_blocking() operation? Especially in light of the sometask() coroutine explicitly not allowing cancellation at the outer level.

As currently implemented, it seems that the above code could be cancelled because the internal check for cancellation only looks at the top-most level of nesting. I'm now wondering if it should consider all of the levels. That is, cancellation is only allowed if ALL of the levels allow it.

Thoughts?

@dabeaz
Copy link
Owner

dabeaz commented Dec 29, 2016

Also, can the allow_cancellation feature be used outside of a corresponding defer_cancellation block?

@dabeaz
Copy link
Owner

dabeaz commented Dec 29, 2016

Bigger question: Do we even need allow_cancellation? Is it there purely for completeness or is there a use-case? The existence of both "allow" and "defer" complicates things considerably in my opinion. On the other other hand, if only "defer" was provided, that is much easier to reason about. Just trying to wrap my brain around this whole thing so that I can adequately document it (and the associated semantics).

@imrn
Copy link
Author

imrn commented Dec 29, 2016

Wouldn't be logical that inner most *_cancellation wins?
That is:

async def cancel_me():
    async with defer_cancellation:
        # Here defered   <-->
        async with allow_cancellation:
            # Here allowed   <---> Cancellation can only happen here
            async with defer_cancellation:
                # Here defered  <-->

task = await spawn(cancel_me())
await task.cancel()

Same thing applies to call stack.
Is it difficult to implement?

@imrn
Copy link
Author

imrn commented Dec 29, 2016

Each context enter/exit can update "cancellable" status of
the task. Pending cancellations can be processed when allowed.
Is it simpler in such words than actual code?

@imrn
Copy link
Author

imrn commented Dec 29, 2016

A single "cancellable" flag will not handle below example.
It can not be an alternative to current stack structure..
I take it back. :(

async def cancel_me():
    async with defer_cancellation:
        # Here defered   <-->
        async with defer_cancellation:
            # Here defered

task = await spawn(cancel_me())
await task.cancel()
  • This is not normal code, however equivalent combination
    can occur on a call stack.

@dabeaz
Copy link
Owner

dabeaz commented Dec 29, 2016

Right now, the behavior is implemented such that the inner-most block determines what happens. However, I don't like it. I think it makes any use of allow_cancellation a ticking time-bomb in the code-especially if it's buried in a library function. See my earlier comment for an example of such code.

Big picture: I don't think a singe inner-mostallow_cancellation should enable cancellation when nested defer_cancellation operations have been applied. If some outer defer_cancellation is in effect, it was done for a reason--it shouldn't be something that can be arbitrarily overridden by accident within the innards of a library function that happens to use allow_cancellation someplace.

@imrn
Copy link
Author

imrn commented Dec 29, 2016

Yeah I'm eagerly flip-flopping: :)
How about restoring task's "PREVIOUS cancellable state"
on each *_cancellation context exit. Can we
eliminate stack structure this way?
Any counter proof/example for it?

@imrn
Copy link
Author

imrn commented Dec 29, 2016 via email

@dabeaz
Copy link
Owner

dabeaz commented Dec 29, 2016

I'm sorry, but this makes no sense whatsoever. If I write code like this:

async def spam():
        async with defer_cancellation:
                 await coro1()
                 await coro2()
                 await coro3()

My expectation is that this code is guaranteed to not receive a cancellation in that block of statements. No matter what coro1(), coro2(), or coro3() are doing internally. End of story.

As Curio is currently implemented, ANY use of allow_cancellation inside coro1(), coro2() or coro3() violates this guarantee. The reason is that a cancellation request is supposed to kill the entire task regardless of the call stack. A cancellation request also can't be ignored. So, even if one of those coroutines "caught" the cancellation request, it is not allowed to just eat it and keep on going. That's not cancellation.

The bottom line is that any use of allow_cancellation has the potential to make defer_cancellation fail to work as advertised--making it rather worthless as a feature in my opinion.

The whole situation is further complicated by the fact that timeouts are intermingled with cancellation as a concept. Unlike a true cancellation, a timeout can be caught and acted upon. So, that's it's own problem on top of everything else.

So, with all of this, I'm not exactly sure how I'd want to proceed. I just know that I don't want to be fielding a bunch of bug reports regarding tasks being mysteriously cancelled despite the fact that they're in the middle of a defer_cancellation block at the time.

@dabeaz
Copy link
Owner

dabeaz commented Dec 29, 2016

My gut feeling on all of this is to remove allow_cancellation altogether and tell people that they can delay the delivery of a cancellation with defer_cancellation. This should be accompanied by some kind of warning about its purpose and how it is probably best used on a series of fine-grained operations that shouldn't be interrupted as opposed to some kind of setting that you'd just turn on for huge portions of task execution.

@imrn
Copy link
Author

imrn commented Dec 30, 2016

Your assumptions are consistent only if "allow_cancellation is non-existent".
If it exists (I think it should) you'd better consider my proposal.
If allow_cancellation is unused my proposal is equivalent to yours.

Lets think "allow_cancellation" as an advanced option that can open
cancellation window down the call chain. With only defer_cancellation
at hand it is impossible to have something like that down the call chain.
Even with fine-grained defers...

@imrn
Copy link
Author

imrn commented Dec 30, 2016

The reason is that a cancellation request is supposed to kill the
entire task regardless of the call stack.
So, even if one of those coroutines "caught" the cancellation request,
it is not allowed to just eat it and keep on going. That's not cancellation.

This is agreeable. But what's the current status if I catch and try to continue on
an ordinary cancellation in a normal task? (the one without any defers/allows)

Can I disregard a cancellation in a task? I theory I guess I can.
I can devise my try:catch: block so that I never give back control.
Right? I did't try it till now. I'll do it but requires some setup..

@njsmith
Copy link
Contributor

njsmith commented Dec 30, 2016

My gut feeling on all of this is to remove allow_cancellation altogether and tell people that they can delay the delivery of a cancellation with defer_cancellation

So history: I started out implementing just defer_cancellation like you said, then when I tried using it I found it really difficult to do without allow_cancellation. The use case is for something like a loop that does various things, and wants to isolate the cancellation handling to a small part. I'm not sure that it's actually possible to convert this into using just defer_cancellation only. (The problem is that you can't just split up a big defer_cancellation into multiple small defer_cancellations, because exiting a defer_cancellation block creates a cancellation point.)

    async def _supervise_loop(self):
        assert self._started
        chained_exc = None
        async with curio.defer_cancellation:
            while self._tasks or not self._shutting_down:
                try:
                    # This is the only line where we want to allow
                    # cancellation / timeouts:
                    async with curio.allow_cancellation:
                        task = await self._q.get()
                    await self._process_finished_task(task)
                except BaseException as exc:
                    if not self._shutting_down:
                        await self.start_shutdown()
                    if chained_exc is not None:
                        exc.__context__ = chained_exc
                    chained_exc = exc
        # All done!
        await self._finished.set()
        if chained_exc is not None:
            raise chained_exc

You're absolutely right that the current semantics are weird. Basically my rationale was just, AFAICT all possible semantics are kinda weird, it only makes sense to use these in tight coordination with each other, so I just went with the simplest weird semantics. It would be nice if there were something better, just not sure what that is.

I'm also very strucky by your point about defer_cancellation just flat-out breaking any subroutines that might use timeouts internally. That's another terrible compositionality break. Intuitively, I guess the concept we're groping towards here is that we want to be able to manage external cancellation by deferring it to some controlled point, but this shouldn't affect internal cancellation (like a timeout_after that's inside the defer_cancellation).

I have some not-quite-formed thoughts about how a more-reified notion of cancellation like this one might help here, but I'll stop here for now rather than try to write a dissertation at half-past-midnight or delay the rest of this post indefinitely :-)

@imrn
Copy link
Author

imrn commented Dec 30, 2016

Here is an unkillable task! Any sufficiently determined one
can eat up the cancellation. So don't put the blame on
poor allow_cancellation. Probably there is nothing
can be done to protect againts it. Don't worry be happy.

And it may be better to review task management code
against the assumption of all well behaving tasks..
I guess best can be done is to assume cancellation
just as an advisory thing. No guarantees.. Sorry.

from curio import spawn, sleep, run

async def cancel_if_you_can():
    while True:
        try:
            print('Alive!')
            await sleep(1)
        except:
            pass

async def main():
    inmortal = await spawn(cancel_if_you_can())
    while True:
        sleep(1)
        await inmortal.cancel()

run(main())

@dabeaz
Copy link
Owner

dabeaz commented Dec 30, 2016

Yes, you can make a task unkillable. However, if you do, a lot of other stuff is going to break. For instance, run() won't ever return and the cancel() method itself will deadlock because it waits for the task to terminate. So, it's going to show up as a programming error elsewhere. I'm pretty sure that the documentation already states that cancellation exceptions can be caught (and actions taken), but that it's not legal to simply ignore a cancellation request and keep on going.

@dabeaz
Copy link
Owner

dabeaz commented Dec 30, 2016

I've been doing a lot of thinking about this feature and simply don't see how I can make it work as formulated. I believe it is inherently ill-defined as a context-manager. Specifically, if I see code like this:

async def spam():
        async with defer_cancellation:
                await coro()
                print('Done')

I fully expect that the "Done" message will appear no matter what happens with respect to cancellation. I put it in the same category as code that might appear in a finally: block. That code is supposed to run without cancellation. A strong guarantee.

There are all sorts of problems with it though:

  1. If coro() uses allow_cancellation, it's broken. Task can be cancelled.
  2. If coro() itself uses defer_cancellation, it's broken. Task can be cancelled on exit from the inner deferred block.
  3. If coro() gets a cancellation request internally and handles it, where does the task go from there? It can't ignore the request. It can't allow the task to keep running. coro() has to report a failure somehow. How does it do this exactly? How does the cancellation proceed? It's broken.
  4. If coro() relies upon timeouts, it's broken. Timeouts can't be delivered if a defer_cancellation is in effect.

I just don't see any way out of this entanglement. I think you could fix it if you reformulated defer_cancellation so that it launched code to be shielded from cancellation into its own task. Maybe something like this:

async def defer_cancellation(coro):
       task = await spawn(coro)
       try:
             return await task.join()
       except CancelledError:
             await task.join()      # Wait for operation to finish
             raise

This is a lot simpler. It also allows the shielded coroutine to use timeouts and other features. Yes, it requires you to put the uninterruptible code into its own coroutine--so you'd lose the context-manager aspect of it. However, I think the restored sanity makes up for it.

@dabeaz
Copy link
Owner

dabeaz commented Dec 30, 2016

Another possibility is to disallow nesting entirely. Curio could be made to raise a runtime error if you did it.

@imrn
Copy link
Author

imrn commented Jan 2, 2017

As a documentation note, lets warn users that
"Non-Specific Exception Handling In a Call Tree Where Cancellations/Timeouts Occur"
means asking for untrackable problems. An innocent looking try/catch block
at an unnoticable corner of a file such as

try:
    await timeout_after(10, co)
except:
    pass

will eat up your cancellations/timeouts...

Each try catch block in such a tree "with some awaits in it",
should be ready to handle exception coming from outside AND
re-raise them.

This is something we dont have to think for sync code.

@njsmith
Copy link
Contributor

njsmith commented Jan 2, 2017

I think you could fix it if you reformulated defer_cancellation so that it launched code to be shielded from cancellation into its own task.

I don't see how to safely write my task supervisor loop using this... am I missing something?

Another possibility is to disallow nesting entirely. Curio could be made to raise a runtime error if you did it.

This would be OK, I guess, treating these as a pretty low-level finicky constructs that most people shouldn't interact with. But it would still mean that "whether this function uses a timeout internally" would become an external part of its api (because with defer_cancellation: await some_func() would either raise a nesting error or not), which breaks encapsulation in an unfortunate way. It would be nice to do better.

Here's a sketch of an alternative way, based on that cancellation token idea I linked up-thread.

First, we reify the idea of a "cancellation token" as an first-class object. I'm not sure how much public API we want, but conceptually it's something like a curio.Event that can be set exactly once, and when it's set it also has a payload which is an exception object that inherits from curio.CancelledError.

Each task has associated with it an "active set" of cancellation tokens. If one of these objects becomes "set", then the associated exception is raised at the next cancellation point. (I guess if multiple are set, then we can raise a chained exception?)

When we create a new task, we also create an associated cancellation token, which we stash somewhere in the task object. Initially, the task's "active set" consists of this token alone. Task.cancel() sets this token:

class Task:
    def __init__(self, ...):
        self.cancel_token = CancelToken()
        self.cancel_set = {self.cancel_token}
    def cancel(self):
        self.cancel_token.set(curio.TaskCancelledError())
    ...

There are some primitives that let us manipulate the cancel set, in particular adding and removing tokens. (When a token that is already 'set' is added, it immediately raises an error.)

defer_cancellation works by popping off everything in the current cancel set, running some code, and then putting it back. It returns the values it popped, or at least some sort of handle to them (could be opaque, I dunno), so the defer_cancellation/allow_cancellation pattern becomes something like:

async with defer_cancellation as cancel_handle:
    ...
    async with allow_cancellation(cancel_handle):
        ...
    ...

timeout_after and friends are modeled conceptually as creating a new cancel token which will be set when the given timeout expires and pushing them into the cancel set.

Interesting example number 1:

async with timeout_after(10):
    ...
    async with defer_cancellation:
        # code in here is protected from Task.cancel() and from the timeout_after(10)
        ...
        async with timeout_after(5):
            # code in here is protected from those, but is still subject to the timeout_after(5)
            ...
    # once we leave the defer_cancellation block, the timeout_after(10) is back in force

Interesting example number 2: create a global cancellation token representing "the server is being shut down". Whenever we spawn a new task, use some kwarg to spawn to add this token to the task's cancel set. Then when we want to cancel all tasks and shut down, we just set this one token.

@njsmith
Copy link
Contributor

njsmith commented Jan 2, 2017

timeout_after and friends could return the associated cancel token, and their exceptions could carry it, allowing using to do things like check whether an exception was due to a particular timeout, and also for code like

async with ignore_after(10) as token:
    ...
if token.is_set():
    # the timeout fired
    ...

(This is related to something I've been meaning to propose for a bit, actually. There are basically three things people might want to do after a timeout has fired: (a) ignore, (b) catch and handle, (c) throw an exception like "I give up (btw it was because of a timeout)". Right now we use ignore_after for ignore and timeout_after for catch-and-handle and throw-an-exception, but this ends up being kinda suboptimal because catch-and-handle requires complex syntax with multiple nesting, and the mixture of the two of them creates the mess where we end up needing three exception types to avoid mixing up these two cases. If we switch to using the pattern above, so that there's one context manager used for the ignore and catch-and-handle cases, and a different one used for throw-an-exception, then these both go away.)

@dabeaz
Copy link
Owner

dabeaz commented Jan 2, 2017

I've been doing a lot of thinking about this whole subject over the last few days and I'm afraid that {allow|defer}_cancellation is going on the chopping block. The semantics are too complicated and the current implementation is logically unsound in the presence of nesting. I don't know if adding tokens saves it or not, but I do know that "if the implementation is hard to explain, it's a bad idea" is part of the Zen of Python ;-). I also know that my head is starting to explode.

I think a better solution to this whole problem is going to be cancellation polling. As an option, it's easy to make it so that tasks NEVER receive a CancelledError exception. Instead, a flag can be set that must be checked via explicit polling::

async def supervisor(self):
        myself = await current_task()
        while self._tasks and not self._shutting_down:
                whatever
                ...
                if myself.cancelled:            # Have I been cancelled? 
                      await self._start_shutdown()

When launching the supervisor, you'd supply some kind of option to spawn() to not deliver a CancelledError exception:

  task = await spawn(self.supervisor(), allow_cancellation=False)

Timeouts, although related to cancellation, are a different kind of animal. Timeouts can be handled and ignored. The semantics of nesting them is more well-defined. I'm not inclined to change how they currently work. If the above supervisor is concerned about getting a timeout from outside, then don't apply a timeout to it. Or wrap it:

  task = await spawn(self.supervisor(), allow_cancellation=False)
  try:
         await task.join()
  except TaskTimeout:
         await task.cancel()

@dabeaz
Copy link
Owner

dabeaz commented Jan 2, 2017

An explicit flag (for polling) could also be used to control things more finely.

 async def supervisor(self):
         myself = await current_task()
         myself.allow_cancellation = False
         while self._tasks and not self._shutting_down:
                  ...
                  try:
                       myself.allow_cancellation = True
                       if myself.cancelled:
                           raise CancelledError()
                       else:
                           task = await q.get()
                       myself.allow_cancellation = False
                  except CancelledError:
                       ...

I suppose that could be wrapped in some kind of context manager to bring it back to something similar to what's there now. Main difference is that the implementation would need to explicitly check the cancelled attribute to see if cancellation was requested prior to turning on cancellation.

As it stands now though, the implementation is far too complicated. Dedicated kernel traps, internal stacks, and numerous support functions. It doesn't need to be this complicated..

@njsmith
Copy link
Contributor

njsmith commented Jan 2, 2017 via email

@dabeaz
Copy link
Owner

dabeaz commented Jan 2, 2017

Use a timeout! Partly joking... maybe.

I am current sorting through a reworking of the machinery involved with this. Right now, the implementation of {defer|allow}_cancellation is too complicated and the semantics are too weird to explain to mortals.

I think there is an alternative way to approach this that will allow you to do what's needed and which can be explained more sanely. Working on it now.

@njsmith
Copy link
Contributor

njsmith commented Jan 2, 2017

Oh hmm, somehow I didn't see this comment until just now.

Sure, one could replace the kernel traps with direct manipulation of the task struct, and the stack could be replaced with a single scalar. Most of the implementation complexity IMO comes from how timeouts works in the kernel (not sure this is avoidable), and most of the semantic complexity comes from the issues discussed in this thread around nesting and interaction with timeouts. But I'll be interested to see what you come up with :-)

@dabeaz
Copy link
Owner

dabeaz commented Jan 3, 2017

This whole issue is quite hard and interesting. I love thinking about this stuff ;-).

@dabeaz
Copy link
Owner

dabeaz commented Jan 3, 2017

I have implemented a major reworking on cancellation control and related features. Rather than describe it here, please read this:

http://curio.readthedocs.io/en/latest/devel.html#cancellation-control

This change will break all existing use of {defer|allow}_cancellation (sorry). There are comparable features with {disable|enable}_cancellation() functions though. It should not break other Curio code so far as I know.

Basically gist is this: cancellation can still be controlled, but there are some refined rules about when things happen, handling of exceptions, and nesting.

@imrn
Copy link
Author

imrn commented Jan 3, 2017

It really seems fine. Thanks.
It says nested disable_cancellation is allowed.
Does it allow any combination of nesting with BOTH disable_cancellation
AND enable_cancellation?
That is:
disable -> enable -> enable -> disable -> enable -> etc etc..

@dabeaz
Copy link
Owner

dabeaz commented Jan 3, 2017

Right now, it does not allow any combination of nesting. So something like disable->enable->enable is illegal and results in a RuntimeError. disable->enable->disable->enable is okay though.

Here is my general thought on this... cancellation control is a pretty advanced thing to be introducing into your code. It makes everything more difficult to reason about and requires a LOT of attention to fine details. Because of that, its use really requires a fair amount of forethought. My general fear is that people who don't quite understand what they're doing would start using these features haphazardly. For example, writing a bunch of library functions like this:

async def coro():
        async with enable_cancellation():
               ...

IMO, just having a bare enable_cancellation() sitting there at the top of functions like that would have the potential to cause all kinds of insanity elsewhere. By disallowing the use of enable_cancellation() all by itself, I think it will "force" people to use it in closer proximity to disable_cancellation()--most likely in the same function. My hope is that this would produce greater clarity about what is actually happening with respect to cancellation control when these features are being used.

From an implementation standpoint, the restriction is somewhat artificial. There's nothing about the implementation that nesting breaks. The risk of breakage is more with user-written code that happens to use these features.

@dabeaz
Copy link
Owner

dabeaz commented Jan 3, 2017

One thing I'm not entirely sure about is the actual delivery of cancellation exceptions. In the current implementation, it only happens on the next blocking operation after a disable_cancellation() context. It could potentially happen on exit from the disable_cancellation() context manager. I'd need to think about that. On the whole, the idea of raising exceptions in the __exit__() method of a context manager gives me an unsettled feeling though. For example, if you read PEP 343 (https://www.python.org/dev/peps/pep-0343/ ), you'll see comments like this:

However, if __exit__() propagates an exception to its caller, this
means that __exit__() *itself* has failed.  Thus, __exit__()
methods should avoid raising errors unless they have actually 
failed.  (And allowing the original error to proceed isn't a 
failure.)

Raising a cancellation exception in the __exit__() method just feels sort of dicey. I don't know.

@imrn
Copy link
Author

imrn commented Jan 3, 2017 via email

@dabeaz
Copy link
Owner

dabeaz commented Jan 3, 2017

I think I'm going to play it "safe" to start with this. This whole cancellation handling issue is fascinating and hairy at the same time. Putting some constraints on it at the start might be a prudent move ;-).

@dabeaz
Copy link
Owner

dabeaz commented Mar 3, 2017

Closing for now. enable/disable cancellation are implemented and available. Revisit in a new issue of further ideas/improvements arise.

@vinipsmaker
Copy link

One thing I'm not entirely sure about is the actual delivery of cancellation exceptions. In the current implementation, it only happens on the next blocking operation after a disable_cancellation() context.

That's the correct answer.

Imagine implementing a mutex. mutex.lock() must not raise TaskCancelled, so the body of the function is wrapped in disable-cancellation blocks. Once the block ends, the pending cancellation request must not be raised — it must stay pending until next suspension/cancellation point. That'll be the case for any function that must not raise TaskCancelled where the only usable behaviour that curio can offer is to defer cancellation until next suspension/cancellation point.

About the whole enable_cancellation thing, you could mirror Boost.Thread design (but just this small part) and offer a restore_interruption — not a enable_interruption. restore_interruption restore cancelability state to whatever it was before a disable_interruption block that the function controls. For instance...

async def _supervise_loop(self):
    assert self._started
    chained_exc = None
    b1 = disable_cancellation()
    async with b1:
        while self._tasks or not self._shutting_down:
            try:
                # This is the only line where we want to allow
                # cancellation / timeouts:
                async with restore_cancellation(b1):
                    task = await self._q.get()
                await self._process_finished_task(task)
            except BaseException as exc:
                if not self._shutting_down:
                    await self.start_shutdown()
                if chained_exc is not None:
                    exc.__context__ = chained_exc
                chained_exc = exc
    # All done!
    await self._finished.set()
    if chained_exc is not None:
        raise chained_exc

Also, task.cancel() should not act as a cancellation point when blocking=False. It makes a lot easier/safer to express certain concurrency patterns.

I also happen to like this topic. I might necrobump a few more threads while I follow the past discussions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants