Ensure trio service shutdown follows DAG cancellation logic #12

pipermerriam · 2019-12-12T16:13:11Z

What is wrong

See: https://github.com/ethereum/async-service/blob/04bbca368b0dfdeb87e8f9ff871ffb23a27b01ee/tests-asyncio/test_task_dag_cancellation_order.py

The asyncio service builds a DAG of tasks. The idea is that all of the tasks a service runs can be modeled as a dag

class MyService(Service):
    async def run():
        with open_socket() as socket:
            self.run_task(self.consume_socket, socket)
            await self.manager.wait_finished()
    async def consume_socket(self, socket):
        while True:
            try:
                do_thing_with_socket(socket)
            except CANCELLATION:
                do_socket_cleanup(socket)

In the example above we have a DAG of run() -> consume_socket(). The consume_socket task needs the socket to be open. The run() method will close the socket when it gets cancelled.

IF run() is cancelled before consume_socket() then when consume_socket get's cancelled it will try to do do_socket_cleanup which will error because the socket will have already been closed.

In order to address this we need to ensure that we unroll our task DAG by cancelling them from the leaf's back towards the root.

How can it be fixed

The test for the asyncio version should be replicated in the trio version and then if trio doesn't do this for us automatically we will need to do something like lazily spawning new nurseries for each task to house it's sub-tasks and then doing the nursery cancellations in the proper order.

The text was updated successfully, but these errors were encountered:

gsalgado · 2019-12-16T21:01:06Z

After spending a whole day on this, I'm starting to think it might not make sense to have our trio implementation try to cancel tasks in any specific order because that will be tricky to implement (as we don't handle task cancellation directly like in our asyncio implementation) but more importantly because it wouldn't prevent the very problem you describe in your example. I say that because I believe a more realistic example would have an async function that performs any cleanup, like in the code below:

class MyService(Service):
    async def run():
        with open_socket() as socket:
            self.run_task(self.consume_socket, socket)
            await self.manager.wait_finished()
    async def consume_socket(self, socket):
        while True:
            try:
                await do_thing_with_socket(socket)
            except CANCELLATION:
                await do_socket_cleanup(socket)

In the case above, even if we do explicitly cancel the consume_socket() task first, the await do_socket_cleanup(socket) will probably yield control and we'd end up cancelling the run() task which could end up closing the socket before the cleanup finished. In fact, I think this is what happens in the asyncio implementation, even, because even though it attempts to wait for the cancelled task to finish, that seems to always immediately raise a CancelledError, so those tasks wouldn't perform any cleanups not even if they were done synchronously.

Having said all that, it is late here and I've been staring at this for too long, so I may be missing something, but If nobody chips in to clarify what, tomorrow I'll try to write a test to see if the asyncio implementation prevents the problem you describe in the example above

pipermerriam · 2019-12-16T22:32:52Z

Hope this helps. I think I have it working.

#22

So one thing I realized right at the end is that we don't actually need sub-nurseries for each child task, but rather just a trio.CancelScope for each child.

As for whether we need this or not. In your example above, specifically in the asyncio context, I believe it will work correctly because during cleanup we do an await task which will block until the except block has completed. As for trio, I'm less confident, because there is no built-in way to say "wait until the task is done".

If you take a look at my implementation you'll see I added a done event to the tracking. As we cancel each cancel scope it waits for the task to fully exit before moving onto cancelling the next one.

I did however break something in the lifecycle that I'm trying to clean up....

pipermerriam · 2019-12-16T23:37:59Z

I did a bit more work to get the test suite passing again. I think that the test_sub_service_cancelled_when_parent_stops test might be slightly incorrect as it exists now in master.

My goal for this library is very concretely define the lifecycle of services and the way the test is currently written is suggests that the child service should register as being cancelled which I'm not entirely sure is correct.

I've re-written the test so that it runs the child service as an actual child service, and I've updated the cancellation handling logic to include explicit cancellation of child services. I'm still not 100% confident that #22 represents everything exactly right but I think it's close enought to iterate from.

So my questions:

How do you feel about the implementation in Piper/trio dag shutdown #22 and my explanation of why I think this feature is needed.
Assuming you are 👍 to move forward with this feature, do you want to take Piper/trio dag shutdown #22 and finish it off or would you prefer I do it?

gsalgado · 2019-12-17T14:17:52Z

As for whether we need this or not. In your example above, specifically in the asyncio context, I believe it will work correctly because during cleanup we do an await task which will block until the except block has completed.

~~That await task is the one I meant that returns immediately with a CancelledError.~~

Actually, that was the impression I had while debugging yesterday, but I couldn't confirm that with a unit test

gsalgado · 2019-12-18T14:09:22Z

Trying to summarize things here...

The approach I tried in #21 requires accessing trio internals and doesn't ensure tasks are cancelled in the order we want.

Then I tried a simpler approach in #26, which awaits for tasks in the order we want, but they can still be cancelled in any order. As I said there, I can't find a way to test the order of task termination, and even if I did I'm not sure this approach would work as we expect because a task might terminate immediately when cancelled, before we have a chance to wait for all of its children

Piper's approach in #22 ensures order of cancellation of tasks, but at the expense of us having to be completely in charge of task cancellation, which is far from ideal as that is not trivial stuff and likely to bite us in the long run, as we've seen with CancelToken.

Then there's the new test we've been discussing, which is supposed to verify that we don't let parents terminate before their children. All approaches above pass that test, but so does the current TrioManager implementation in master, so unless that test is wrong, we might not really need any of the changes above because trio's task cancellation mechanism is smart enough

Oh, and all PRs above break trinity tests, so it'd be good to investigate that before merging any of them

pipermerriam · 2019-12-18T17:07:44Z

Sorry for spreading conversation across multiple issues.

so unless that test is wrong,

I think #27 demonstrates that the test isn't testing the thing it's trying to test. I'm not 100% sure so feel free to poke holes in my testing logic.

I think we must do something like what I did in #22 because I'm pretty sure we have to control both the cancellation order and the termination order to be able to ensure task lifecycle follows the expected rules.

gsalgado · 2019-12-18T18:05:55Z

Those tests we're coming up with are almost as complicated as the code they test, and that's a big problem because it is very hard to find out whether or not they're testing the right thing. Unit tests are supposed to be simple and straightforward as they can't, by definition, be unit tested and so need to be validated by humans. I don't mean to assign blame or anything, as I'm also guilty of writing overly complex tests, but after spending a significant part of the last few days debugging the tests themselves, I thought I'd remind us all of this.

About the task cancellation order, I'm afraid you're right that we may have to assume full control if we want to enforce a certain order, so I guess we can move forward with #22

pipermerriam · 2019-12-18T21:23:23Z

Ok, this is being done in #31 and (built on top of #29 which fixes the actual test)

Probably it can solve the dag issue: ethereum/async-service#12

pipermerriam assigned gsalgado Dec 12, 2019

gsalgado mentioned this issue Dec 16, 2019

Refactor test_service_task_cancellation_dag_order and run it against trio #21

Closed

gsalgado closed this as completed Dec 20, 2019

mhchia added a commit to mhchia/py-libp2p that referenced this issue Dec 24, 2019

Upgrade to 0.1.0a4

fb6076c

Probably it can solve the dag issue: ethereum/async-service#12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure trio service shutdown follows DAG cancellation logic #12

Ensure trio service shutdown follows DAG cancellation logic #12

pipermerriam commented Dec 12, 2019

gsalgado commented Dec 16, 2019

pipermerriam commented Dec 16, 2019

pipermerriam commented Dec 16, 2019 •

edited

Loading

gsalgado commented Dec 17, 2019 •

edited

Loading

gsalgado commented Dec 18, 2019

pipermerriam commented Dec 18, 2019

gsalgado commented Dec 18, 2019

pipermerriam commented Dec 18, 2019

Ensure trio service shutdown follows DAG cancellation logic #12

Ensure trio service shutdown follows DAG cancellation logic #12

Comments

pipermerriam commented Dec 12, 2019

What is wrong

How can it be fixed

gsalgado commented Dec 16, 2019

pipermerriam commented Dec 16, 2019

pipermerriam commented Dec 16, 2019 • edited Loading

gsalgado commented Dec 17, 2019 • edited Loading

gsalgado commented Dec 18, 2019

pipermerriam commented Dec 18, 2019

gsalgado commented Dec 18, 2019

pipermerriam commented Dec 18, 2019

pipermerriam commented Dec 16, 2019 •

edited

Loading

gsalgado commented Dec 17, 2019 •

edited

Loading