Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure trio service shutdown follows DAG cancellation logic #12

Closed
pipermerriam opened this issue Dec 12, 2019 · 8 comments
Closed

Ensure trio service shutdown follows DAG cancellation logic #12

pipermerriam opened this issue Dec 12, 2019 · 8 comments
Assignees

Comments

@pipermerriam
Copy link
Member

What is wrong

See: https://github.com/ethereum/async-service/blob/04bbca368b0dfdeb87e8f9ff871ffb23a27b01ee/tests-asyncio/test_task_dag_cancellation_order.py

The asyncio service builds a DAG of tasks. The idea is that all of the tasks a service runs can be modeled as a dag

class MyService(Service):
    async def run():
        with open_socket() as socket:
            self.run_task(self.consume_socket, socket)
            await self.manager.wait_finished()
    async def consume_socket(self, socket):
        while True:
            try:
                do_thing_with_socket(socket)
            except CANCELLATION:
                do_socket_cleanup(socket)

In the example above we have a DAG of run() -> consume_socket(). The consume_socket task needs the socket to be open. The run() method will close the socket when it gets cancelled.

IF run() is cancelled before consume_socket() then when consume_socket get's cancelled it will try to do do_socket_cleanup which will error because the socket will have already been closed.

In order to address this we need to ensure that we unroll our task DAG by cancelling them from the leaf's back towards the root.

How can it be fixed

The test for the asyncio version should be replicated in the trio version and then if trio doesn't do this for us automatically we will need to do something like lazily spawning new nurseries for each task to house it's sub-tasks and then doing the nursery cancellations in the proper order.

@gsalgado
Copy link
Contributor

After spending a whole day on this, I'm starting to think it might not make sense to have our trio implementation try to cancel tasks in any specific order because that will be tricky to implement (as we don't handle task cancellation directly like in our asyncio implementation) but more importantly because it wouldn't prevent the very problem you describe in your example. I say that because I believe a more realistic example would have an async function that performs any cleanup, like in the code below:

class MyService(Service):
    async def run():
        with open_socket() as socket:
            self.run_task(self.consume_socket, socket)
            await self.manager.wait_finished()
    async def consume_socket(self, socket):
        while True:
            try:
                await do_thing_with_socket(socket)
            except CANCELLATION:
                await do_socket_cleanup(socket)

In the case above, even if we do explicitly cancel the consume_socket() task first, the await do_socket_cleanup(socket) will probably yield control and we'd end up cancelling the run() task which could end up closing the socket before the cleanup finished. In fact, I think this is what happens in the asyncio implementation, even, because even though it attempts to wait for the cancelled task to finish, that seems to always immediately raise a CancelledError, so those tasks wouldn't perform any cleanups not even if they were done synchronously.

Having said all that, it is late here and I've been staring at this for too long, so I may be missing something, but If nobody chips in to clarify what, tomorrow I'll try to write a test to see if the asyncio implementation prevents the problem you describe in the example above

@pipermerriam
Copy link
Member Author

Hope this helps. I think I have it working.

#22

So one thing I realized right at the end is that we don't actually need sub-nurseries for each child task, but rather just a trio.CancelScope for each child.

As for whether we need this or not. In your example above, specifically in the asyncio context, I believe it will work correctly because during cleanup we do an await task which will block until the except block has completed. As for trio, I'm less confident, because there is no built-in way to say "wait until the task is done".

If you take a look at my implementation you'll see I added a done event to the tracking. As we cancel each cancel scope it waits for the task to fully exit before moving onto cancelling the next one.

I did however break something in the lifecycle that I'm trying to clean up....

@pipermerriam
Copy link
Member Author

pipermerriam commented Dec 16, 2019

I did a bit more work to get the test suite passing again. I think that the test_sub_service_cancelled_when_parent_stops test might be slightly incorrect as it exists now in master.

My goal for this library is very concretely define the lifecycle of services and the way the test is currently written is suggests that the child service should register as being cancelled which I'm not entirely sure is correct.

I've re-written the test so that it runs the child service as an actual child service, and I've updated the cancellation handling logic to include explicit cancellation of child services. I'm still not 100% confident that #22 represents everything exactly right but I think it's close enought to iterate from.

So my questions:

  • How do you feel about the implementation in Piper/trio dag shutdown #22 and my explanation of why I think this feature is needed.
  • Assuming you are 👍 to move forward with this feature, do you want to take Piper/trio dag shutdown #22 and finish it off or would you prefer I do it?

@gsalgado
Copy link
Contributor

gsalgado commented Dec 17, 2019

As for whether we need this or not. In your example above, specifically in the asyncio context, I believe it will work correctly because during cleanup we do an await task which will block until the except block has completed.

That await task is the one I meant that returns immediately with a CancelledError.

Actually, that was the impression I had while debugging yesterday, but I couldn't confirm that with a unit test

@gsalgado
Copy link
Contributor

Trying to summarize things here...

The approach I tried in #21 requires accessing trio internals and doesn't ensure tasks are cancelled in the order we want.

Then I tried a simpler approach in #26, which awaits for tasks in the order we want, but they can still be cancelled in any order. As I said there, I can't find a way to test the order of task termination, and even if I did I'm not sure this approach would work as we expect because a task might terminate immediately when cancelled, before we have a chance to wait for all of its children

Piper's approach in #22 ensures order of cancellation of tasks, but at the expense of us having to be completely in charge of task cancellation, which is far from ideal as that is not trivial stuff and likely to bite us in the long run, as we've seen with CancelToken.

Then there's the new test we've been discussing, which is supposed to verify that we don't let parents terminate before their children. All approaches above pass that test, but so does the current TrioManager implementation in master, so unless that test is wrong, we might not really need any of the changes above because trio's task cancellation mechanism is smart enough

Oh, and all PRs above break trinity tests, so it'd be good to investigate that before merging any of them

@pipermerriam
Copy link
Member Author

Sorry for spreading conversation across multiple issues.

so unless that test is wrong,

I think #27 demonstrates that the test isn't testing the thing it's trying to test. I'm not 100% sure so feel free to poke holes in my testing logic.

I think we must do something like what I did in #22 because I'm pretty sure we have to control both the cancellation order and the termination order to be able to ensure task lifecycle follows the expected rules.

@gsalgado
Copy link
Contributor

Those tests we're coming up with are almost as complicated as the code they test, and that's a big problem because it is very hard to find out whether or not they're testing the right thing. Unit tests are supposed to be simple and straightforward as they can't, by definition, be unit tested and so need to be validated by humans. I don't mean to assign blame or anything, as I'm also guilty of writing overly complex tests, but after spending a significant part of the last few days debugging the tests themselves, I thought I'd remind us all of this.

About the task cancellation order, I'm afraid you're right that we may have to assume full control if we want to enforce a certain order, so I guess we can move forward with #22

@pipermerriam
Copy link
Member Author

Ok, this is being done in #31 and (built on top of #29 which fixes the actual test)

mhchia added a commit to mhchia/py-libp2p that referenced this issue Dec 24, 2019
Probably it can solve the dag issue:
ethereum/async-service#12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants