New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libuv check watcher infinite loop #1126

Closed
jamadden opened this Issue Mar 1, 2018 · 8 comments

Comments

Projects
None yet
1 participant
@jamadden
Member

jamadden commented Mar 1, 2018

Running CPython 3.7b2 on macOS, I can sometimes produce an infinite loop where a single check watcher gets called over and over again. python -m greentest.monkey_test test_selectors.py PollSelectorTestCase.test_timeout will reliable produce this (with 'test_main()' replaced with 'unittest.main()' so we can call a single test).

That test creates a number of poll watchers and a single check watcher (a Timeout with time of 0, which we turn into a check watcher to make the loop work).

Here's some debugging output from a modified libuv with comments:

Entering run for check # begin uv__run_check
Calling watcher check 0x7ff67d4bbe30 # Our main callback watcher
Stopping for check 0x7ff67d4bbe30; is active? 1
Stopping for check 0x7ff67d4be9d0; is active? 1
UV_CHECK_CLOSE 0x7ff67d4be9d0 # Yep, stopped *and* closed
Stopping for check 0x7ff67d4be9d0; is active? 0
Starting for check 0x7ff67d4e91b0; is active? 0 # WATCH THIS ONE: this is the timer 0
Calling watcher check 0x7ff67d4be9d0 # WAIT, WHOA: You were already stopped!
Calling watcher check 0x7ff67d4bb780
Exiting run for check # Exit that check
Entering run for idle
Exiting run for idle
Entering run for prepare
Calling watcher prepare 0x7ff67d4bb900
Stopping for check 0x7ff67d4bbe30; is active? 0
Exiting run for prepare
Starting for check 0x7ff67d4bbe30; is active? 0

Entering run for check  # Begin another uv__run_check
Calling watcher check 0x7ff67d4bbe30 # main callback again
Stopping for check 0x7ff67d4bbe30; is active? 1
Stopping for check 0x7ff67d4e91b0; is active? 1 # We stop it
UV_CHECK_CLOSE 0x7ff67d4e91b0 # And we close it
Stopping for check 0x7ff67d4e91b0; is active? 0
Starting for check 0x7ff67d4be9d0; is active? 0
Calling watcher check 0x7ff67d4e91b0
WARNING: gevent: Unable to dereference handle; not stopping watcher. Native resources may leak. This is most likely a bug in gevent.
Calling watcher check 0x7ff67d4be9d0
Stopping for check 0x7ff67d4be9d0; is active? 1
UV_CHECK_CLOSE 0x7ff67d4be9d0
Stopping for check 0x7ff67d4be9d0; is active? 0
Starting for check 0x7ff67d4bc330; is active? 0
Calling watcher check 0x7ff67d4e91b0 # But wait, it's called again!
WARNING: gevent: Unable to dereference handle; not stopping watcher. Native resources may leak. This is most likely a bug in gevent.
Calling watcher check 0x7ff67d4e91b0 # And again!
WARNING: gevent: Unable to dereference handle; not stopping watcher. Native resources may leak. This is most likely a bug in gevent.
Calling watcher check 0x7ff67d4e91b0
WARNING: gevent: Unable to dereference handle; not stopping watcher. Native resources may leak. This is most likely a bug in gevent

It appears that the check watcher is in the loop's check queue as the last entry, and that entry is somehow circular. This makes no sense because we stopped (and closed) the watcher and we never call uv_check_start on it again. Note that we never actually exit uv__run_check.

I can patch the libuv source to avoid this issue (make it call uv__is_active() before invoking the callback), but I'd like to understand how this is happening in the first place. So far I haven't been able to reproduce it in small standalone C examples.

@jamadden jamadden added bug libuv labels Mar 1, 2018

@jamadden

This comment has been minimized.

Member

jamadden commented Mar 1, 2018

This appears to be a heisenbug: running under the debugger didn't exhibit the problem.

I see the issue even when everything is just compiled with -g and no optimizations.

@jamadden

This comment has been minimized.

Member

jamadden commented Mar 1, 2018

I can reproduce this in Python 3.6.4 with CFFI 1.11.[25].

@jamadden

This comment has been minimized.

Member

jamadden commented Mar 1, 2018

I can reproduce this on Python 3.5.2 on Linux 4.4.0 and cffi 1.11.5. My standalone C program still does not produce it.

@jamadden

This comment has been minimized.

Member

jamadden commented Mar 1, 2018

[EDIT: This is a bunch of red herrings. It's true that the way they allocate memory exacerbates this situation by hiding it---if I clear the memory when the close callback happens, things crash much soones---but it's not directly at fault.]

This seems to be related to the way CFFI/CPython manage memory for the libuv watcher (handle) structs. We allocate those with a call to ffi.new(), and when we close them, we invoke uv_close(). The contract of uv_close is that it will call a callback when it's really safe to dispose of the memory for the watcher---this is done at the end of every loop iteration. Disposing of the memory before then is liable to lead to a crash.

"Disposing of the memory" in CFFI means deleting the last reference to the object, so we artificially put the watcher pointers into a data structure so that they stay alive. When close callback is invoked, we remove the pointer from the data structure. This causes CFFI to "free" the memory.

If I stop doing that, "leaking" the memory, the infinite corruption stop happening. So the (very handwavy) hypothesis is that when we allocate a new watcher object, it's re-using old memory and either that memory is not zeroed or old pointers still point to that memory (somehow), so when we start it and add it to the tail of the queue, it points to the old data for the old "dead" watcher, leading to this cycle.

That's not a perfect hypothesis. I still have a stopped watcher getting called every time when it should not be, and I can't explain how that's getting back in the watcher queue.

@jamadden

This comment has been minimized.

Member

jamadden commented Mar 1, 2018

This simplified script will reproduce a crash 4 out of 5 times:

from gevent import config
config.loop = 'libuv'

from gevent import select


poll = select.poll()
# 1 is stdout, we can immediately always write to it.
poll.register(1, select.POLLOUT)

# So the OS poll will return each time before the check-watcher timeout happens
print("****FIRST POLL")
poll.poll(-1)
print("****SECOND POLL")
poll.poll(-1)
print("****THIRD POLL")
poll.poll(-1) # BANG!

print("Done")

This crashes in CPython 3.7, 3.6 and 2.7, but does not crash in PyPy2 5.10 (but see below). So that continues to point the finger to something to do with memory management.

I said it doesn't crash in PyPy. I've now had it crash once, so it crashes much less frequently. It also produces the "WARNING: unable to dereference handle" much less often. If I insert gc.collect() calls between the polls, PyPy begins crashing much more often.

@jamadden

This comment has been minimized.

Member

jamadden commented Mar 1, 2018

I added some greenlet tracing, to go along with my libuv tracing, and now I think I see the problem, and it's a doozy.

Here's the output of the plain C program that functions correctly. It uses check watchers in the same way the malfunctioning python program does, so it should behave similarly (I've replaced pointer addresses with symbolic names):

--> Entering run for check number 3
Begin local queue is now CHECK0_NEXT and CHECK2_NEXT
	RunQueue next CHECK0 queue CHECK0_NEXT
	RunQueue next CHECK1 queue CHECK1_NEXT
	RunQueue next CHECK2 queue CHECK2_NEXT
Local queue is now CHECK1_NEXT and CHECK2_NEXT # (1)
----> Calling watcher check CHECK0
Stopping for check CHECK0; is active? 1
Handle CHECK0 has queue 0 0x100531ad0 and 1 0x100531ad0
Stopping for check CHECK1; is active? 1
Handle CHECK1 has queue 0 CHECK2_NEXT and 1 0x7ffeef6f1a90
...
<---- Called  watcher check CHECK0
Begin local queue is now CHECK2_NEXT and CHECK2_NEXT # (2)
	RunQueue next CHECK2 queue CHECK2_NEXT
...
<-- Exiting run for check

This is the trace of the uv__run_check function. The lines marked (1) and (2) are the key. The line marked (1) is before we call the first check callback. That first check callback goes on to stop itself and the next callback (0x10050e0d8 and 0x10050e1c8 ) that were already scheduled to be run. When that function returns, we can see at line (2) that the local queue of watchers to run reflects this (i.e., (1) != (2)). The only thing left in the queue is the one remaining original watcher.

Now here's the greenlet version as given above:

--> Entering run for check number 3
Begin local queue is now CHECK0_NEXT and CHECK2_NEXT
	RunQueue next CHECK0 queue CHECK0_NEXT
	RunQueue next CHECK1 queue CHECK1_NEXT
	RunQueue next CHECK2 queue CHECK2_NEXT
Local queue is now CHECK1_NEXT and CHECK2_NEXT (1)
----> Calling watcher check CHECK0
Stopping for check CHECK0; is active? 1
Handle CHECK0 has queue 0 0x106a2aca8 and 1 0x106a2aca8
&&&GREENLET switch (<Hub at 0x106b60508 default default pending=0 ref=0 fileno=5>, <greenlet.greenlet object at MAIN>)
Stopping for check CHECK1; is active? 1
Handle CHECK1 has queue 0 CHECK2_NEXT and 1 0x7ffee9cb1ab0
UV_CHECK_CLOSE CHECK1
Stopping for check CHECK1; is active? 0
****SECOND POLL
...
&&&GREENLET switch (<greenlet.greenlet object at MAIN>, <Hub at 0x106b60508 default default pending=0 ref=0 fileno=5>)
<---- Called  watcher check CHECK0
Begin local queue is now CHECK1_NEXT and CHECK2_NEXT (2)
	RunQueue next CHECK1 queue CHECK1_NEXT
	RunQueue next CHECK2 queue CHECK2_NEXT

Somewhere above, we switched from the main greenlet into the hub. While we're running the CHECK0 callback, we switch from the hub back into the main greenlet. We proceed to stop CHECK1, carry on running the Python program (printing "****SECOND POLL") and then switch back to the hub greenlet to continue running in uv__run_check for the next callback.

Now compare lines (1) and (2). We would expect them to be different because we have modified the run queue by stopping CHECK1...but they're not! They're the same! And CHECK1 is still scheduled to run next, even though we stopped it!

How could this be?

Queues in libuv are very clever. (They remind me of what used to be called "direct threaded" interpreters.) The run queue in uv__run_check is simply two local void* variables on the stack. As we run through the queue, removing elements from it as we run each of them, we temporarily have pointers to this stack-allocated object stored into the heap-allocated watcher objects. If one of them is stopped, it is removed from this queue, and by doing so it makes modifications to the two variables on the stack.

When we switch out a greenlet, greenlet helpfully stores the contents of the stack for us, and when we switch back in, greenlet restores them. The modifications that stopping the watcher made are completely lost. (Worse, depending on how greenlet swaps stacks, they might just be written to some random memory location?)

If I move the local QUEUE variable off of the stack by changing libuv, things seem to work. (Mostly; it turns out that putting your terminal into non-blocking mode by trying to poll on it isn't necessarily the best idea :) )

If anyone can confirm or dispute this, I'd appreciate it. Right now my head hurts and I'm going to go sob quietly to myself for awhile.

@jamadden

This comment has been minimized.

Member

jamadden commented Mar 1, 2018

Windows doesn't seem to suffer from this issue. loop-watcher.c on windows uses a different queue mechanism, not based on the stack. This mechanism is more like libev's where the stack is stored in the loop.

@jamadden

This comment has been minimized.

Member

jamadden commented Mar 2, 2018

I have patched the libuv source to fix this for us. I need to contact the libuv mailing list to see if they have any interest in accepting some change upstream.

jamadden added a commit that referenced this issue Mar 2, 2018

jamadden added a commit that referenced this issue Mar 2, 2018

@jamadden jamadden closed this in 9bc5da8 Mar 2, 2018

jamadden added a commit that referenced this issue Mar 2, 2018

jamadden added a commit that referenced this issue Mar 3, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment