Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support async cancellations. #642

Closed
Zac-HD opened this issue Dec 22, 2022 · 8 comments · Fixed by #726
Closed

Support async cancellations. #642

Zac-HD opened this issue Dec 22, 2022 · 8 comments · Fixed by #726
Labels
bug Something isn't working

Comments

@Zac-HD
Copy link

Zac-HD commented Dec 22, 2022

Over at @anthropics we're keen users of httpx on Trio, and I've noticed that we sometimes have issues where cancelling a task doesn't get all the teardown logic right. For example, if you try to await (or async for/with) in an except BaseException: or finally: block, Trio will immediately re-raise a Cancelled exception instead of running your teardown logic.

This can be pretty subtle, since it's happening as you abandon a task anyway, but can cause various subtle or unsubtle problems that motivated us to build flake8-trio, including the TRIO102 error for this case. I'm pretty sure that httpcore has multiple such issues, e.g.

except BaseException as exc:
await self.response_closed(status)
raise exc

doesn't look cancel-safe to me. Do you consider "this doesn't close the connection properly when cancelled" a bug? If so, please consider this a fairly general bug-report!

(I've also found it infeasible to consistently get this right without tool support. If you want to try flake8-trio many of the checks are framework-agnostic; and if it'd be useful for httpcore and httpx we'd be happy to add anyio support to the linter 🙂)

@tomchristie
Copy link
Member

Hi Zac,

Good to hear from you.

Do you consider "this doesn't close the connection properly when cancelled" a bug?

It sounds like one, yup.

I'd expect there to be plenty of subtleties around this...

How can we start narrowing this down? What broken state can we end up in? Are we able to replicate this in a test case at all?

@Zac-HD
Copy link
Author

Zac-HD commented Jan 24, 2023

(sorry, this dropped into the holiday void; followups should be faster)

Pulling out https://github.com/encode/httpcore/blob/f0657cb43cb707d1672b93b61bb53b4cfb166820/httpcore/_async/connection_pool.py#LL226-L234C30 as a self-contained example:

            try:
                connection = await status.wait_for_connection(timeout=timeout)
            except BaseException as exc:
                # If we timeout here, or if the task is cancelled, then make
                # sure to remove the request from the queue before bubbling
                # up the exception.
                async with self._pool_lock:
                    self._requests.remove(status)
                    raise exc

Suppose that we're running this task inside a Trio nursery or anyio taskgroup, which gets cancelled while we await status.wait_for_connection(). Then we'll be in a cancelled state during the except: block, which means that the attempt to aquire the lock will fail (by raising Cancelled) because in this state Trio only allows a checkpoint (ie await, async for, or checkpoint on enter or exit of an async with) if you've opened a shielded cancel scope.

The impact in this case is that we leak status into self._requests, which probably isn't a huge problem. I'm confident that you could reproduce this with a careful test that started waiting for a connection, slept part of the timeout, and then cancelled the nursery/taskgroup. The self.response_closed(status)-not-running case in my top post is the same principle, just for a larger chunk of code.

This is easily solved by wrapping with anyio.CancelScope(shield=True): around async with self._pool_lock:, and though I expect there are more serious problems latent if we go looking they should also be pretty easy to fix. The real problem is in consistently remembering all these fiddly invariants, which is why I've been working on flake8-trio to flag them for us.

So my suggested solution is to add flake8-trio to your CI system, and then we'll get to work on python-trio/flake8-async#61 so that the currently-trio-specific checks also handle anyio - and then it should 'just' be a matter of fixing lint warnings as they come up!

@PJ-Schulz
Copy link

Hello,
I have found an equal problem with trio.Cancelled and excepting BaseException. I want to make requests with httpx.AsyncClient and cancel them with a cancel_scope after some time, example:

with trio.move_on_after(5):
    response = await client.get(url)

This snippet is called regularly with the same client. After some time a PoolTimeout exception is thrown. The client can't get a connection from the connection_pool because all 100 default connections are not available.

What happened? If trio.Cancelled is called at the time of sending the request and waiting for the response, the connection gets to an unusable state. At the following Point, the trio.Cancelled Exception is catched by BaseException. The asynchronous code in the block will then not be executed. The connection is not cleared and all connections in the connection pool are occupied after a while.

except BaseException as exc:
async with Trace("http11.response_closed", request) as trace:
await self._response_closed()
raise exc

A PoolTimeout then occurs when a new request is made, since no more connections are available.

This behavior can be shown in this example:

async def bar():
    await trio.sleep(0.5)
    print("World")

async def foo():
    try:
        print("Hello")
        await trio.sleep(2)
    except BaseException as ex:
        await bar()

async def main():
    with trio.move_on_after(1):
        await foo()
        
trio.run(main)
# >>> Hello

The example prints 'Hello' but not 'World'

@Zac-HD
Copy link
Author

Zac-HD commented Mar 7, 2023

See also trio.abc.AsyncResource.aclose() and trio.aclose_forcefully() (anyio has the same functions, but less commentary in the docs):

IMPORTANT: This method may block in order to perform a “graceful” shutdown. But, if this fails, then it still must close any underlying resources before returning. An error from this method indicates a failure to achieve grace, not a failure to close the connection.

So self._response_closed() could be considered buggy, because if the task has been cancelled then it will raise when entering async with self._state_lock: and never call the aclose() method at all. I think this can be fixed by wrapping it in a shielded cancelscope, and maybe that could even go inside the method?

@PJ-Schulz
Copy link

The problem with the shielded cancelscope is that when aclose() errors or hangs, the whole app gets stuck at that position.

Furthermore, if then aclose for whatever reason, needs for example 2 seconds in my use case, the timeout I have set in the example to 5 seconds is now at one time at 7 seconds.

So you need a timeout and the shielded cancelscope:

with trio.move_on_after(CLEANUP_TIMEOUT) as cleanup_scope:
    cleanup_scope.shield = True
    await aclose()

The question here is, which value is sensible to choose for CLEANUP_TIMEOUT

@Zac-HD
Copy link
Author

Zac-HD commented Mar 8, 2023

In this case I'd probably go for fail_after(5), but pick any small number that seems to work - but I was wrong to suggest a shielded cancel scope above. The linter warning assumes that you always use async resources as context managers, rather than explicitly .aclose()ing them 😅

The proper solution is to ensure that if await aclose() is cancelled, it still closes the resource before control flow returns. In this case, that means returning the connection to the pool.

@stereobutter
Copy link

I have not dug too deep into this but maybe it is possible to split the connection pool logic (i.e. whether or not to return the connection to the pool) from async stuff? Also when a BaseException like trio.Cancelled is encountered in

except BaseException as exc:
async with Trace("http11.response_closed", request) as trace:
await self._response_closed()
raise exc
I believe the checks in
async def _response_closed(self) -> None:
async with self._state_lock:
if (
self._h11_state.our_state is h11.DONE
and self._h11_state.their_state is h11.DONE
):
self._state = HTTPConnectionState.IDLE
self._h11_state.start_next_cycle()
if self._keepalive_expiry is not None:
now = time.monotonic()
self._expire_at = now + self._keepalive_expiry
else:
await self.aclose()

are not needed anyways and the connection should always be closed?

@tomchristie tomchristie added the bug Something isn't working label May 19, 2023
@tomchristie tomchristie changed the title Teardown errors when tasks are cancelled or time out Support async cancellations. Jun 14, 2023
Zaczero added a commit to Zaczero/osm-relatify that referenced this issue Jun 20, 2023
@tomchristie
Copy link
Member

Closed by #726

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants