Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential race condition in Lock #387

Closed
Tinche opened this issue Nov 12, 2021 · 9 comments · Fixed by #388
Closed

Potential race condition in Lock #387

Tinche opened this issue Nov 12, 2021 · 9 comments · Fixed by #388
Labels
bug Something isn't working

Comments

@Tinche
Copy link

Tinche commented Nov 12, 2021

Hello!

I'm trying to debug an issue we sometimes hit in production through httpx. We're not using anyio ourselves directly (it's regular asyncio), but httpx is based on it.

I'm not 100% sure the scenario I'm documenting here is the exact issue we're seeing, but there's a good chance. The symptoms we're seeing is that a lock instance gets stuck in the locked state with a finished task still set as the owner.

Here's the rough sketch, involving two tasks:

Task 1:

  • acquires the lock

Task 2:

  • tries acquiring the lock, fails, creates an event and starts waiting on it

Task 1:

  • cancels Task 2, releases the lock setting the lock _owner_task to Task 2, gets Task 2's event and sets it

Task 2:

  • wakes up with CancelledError on event.wait(), propagates it

The lock _owner_task is still set to Task 2, rendering the lock stuck.

In our production code it's not Task 1 cancelling Task 2, but aiohttp (but that's not relevant to the example).

Here's a Python snippet demonstrating the problem:

from asyncio import CancelledError, create_task, run, sleep
from contextlib import suppress

from anyio import Lock


async def main():
    lock = Lock()

    async def task_b():
        await sleep(0.1)  # Sleep to allow task_a to set up
        async with lock:
            pass

    t = create_task(task_b())

    async def task_a():
        async with lock:
            await sleep(0.1)  # Sleep to allow task_b to try locking
            t.cancel()
        await sleep(0.2)
        print(t.done())  # True
        print(lock._owner_task)  # TaskInfo(id=4337401360, name='Task-2')

    await task_a()
    with suppress(CancelledError):
        await t


run(main())

The lock should end up in a free state, but it gets stuck being owned by task_b.

@agronholm
Copy link
Owner

What seems to happen here is:

  1. Task A acquires the lock and goes to sleep
  2. Task B awakens from sleep, tries to acquire lock, starts waiting for the event because it could not acquire the lock
  3. Task A schedules a cancellation for task B
  4. Task A releases the lock, transferring ownership to task B
  5. Task B is rescheduled with CancelledError raised in await event.wait()
  6. Task B tries to clean up but because it's no longer waiting on the lock (as it owns the lock now), the cleanup block does nothing

Adding a new check to the cleanup block where it calls release() when the lock is owned by the current task seems to solve the issue. I'll make a PR for this.

@agronholm agronholm added the bug Something isn't working label Nov 13, 2021
@agronholm
Copy link
Owner

As a side note, the Semaphore class may have a similar problem.

@Tinche
Copy link
Author

Tinche commented Nov 13, 2021

I think that's exactly it. Except in my case it's not task A doing the cancelling but something else, but that's immaterial here.

@agronholm
Copy link
Owner

I think that's exactly it. Except in my case it's not task A doing the cancelling but something else, but that's immaterial here.

Task A contains this code: t.cancel()
Therefore it is task A that is doing the cancellation.

@Tinche
Copy link
Author

Tinche commented Nov 13, 2021

I think that's exactly it. Except in my case it's not task A doing the cancelling but something else, but that's immaterial here.

Task A contains this code: t.cancel() Therefore it is task A that is doing the cancellation.

In the example I posted, yes. In the incidents on our production environment, no.

@agronholm
Copy link
Owner

I see. Anyway, there is a PR with the appropriate regression tests fixing the problem for both Lock and Semaphore, so as soon as I can get a review, I will merge it.

@Tinche
Copy link
Author

Tinche commented Nov 13, 2021

Thank you!

@Tinche
Copy link
Author

Tinche commented Nov 16, 2021

Thanks for the quick response on this!

@agronholm
Copy link
Owner

I will cut a release once the contextvar propagation issue is solved (needs a PR that is almost ready).

agronholm added a commit that referenced this issue Nov 17, 2021
agronholm added a commit that referenced this issue Nov 17, 2021
agronholm added a commit that referenced this issue Nov 21, 2021
It was decided to add workarounds for trio rather than wait for the relevant PR to be merged and a new version released.

Fixes #387.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants