-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v4b1: PubSub receive cleanup hang #319
Comments
Hi @bbrowning918 @acu192 — Just a quick update. I've pushed a few cleanup PRs on Actions seem to be going well... 🤔 https://github.com/django/channels_redis/actions I think it's worth rebasing efforts to make sure we're not just hitting Using old stuff issues. I'll note we've got one warning still coming out of redis-py on PY310: =============================== warnings summary =============================== Locally, I'm hitting the freeze maybe one-time-in-two (or so) with the reduced test case here so will dig into that next. |
OK, so yes. When if stalls we just get stuck in that |
HI @Andrew-Chen-Wang — I don't know if you have any bandwidth at the moment — if not no problem, sorry for the noise 🎁 — but would you maybe be able to glance and this, and the related discussion on #317 just to see if something jumps out at you about the redis usage, after the migration to redis-py? Thanks 🙏 |
Hi all - hope you're well, figured I'd pop my head in since I had some free time and see if I could lend a hand. This jumped out as something interesting to investigate, and I can't quite make heads or tails of it after a few minutes of poking about. But I had a feeling that it was something to do with the async timeouts package, and a quick look at their repo led me to this old issue which has repro code that looks suspiciously similar to some of our patterns: aio-libs/async-timeout#229 (comment) Anyway will take a another look tomorrow when I have more time. |
@bbrowning918 can you please have a look at this branch here: zumalabs#11 I'm not entirely sure what the issue is, but the test still highlighted a few improvements nonetheless. |
Hmm, locally it seems to show that both I went down the rabbit hole from the |
I've found at least a hacky workaround:
Bumping the Under a tight window while we're doing our cleanup work, I believe the keep alive kicks off another There is quite a large docstring on |
Nice work @bbrowning918. 🕵️♀️ (Current status: not sure — Once I get the Channels update ready I will swing back here for a play. Happy if you want to make suggestions!) |
@acu192 — Do you have half a cycle to look at the discussion here and see what you think? (Thanks) |
I'll play around a bit this weekend. It seems that in the shard flush we need to take the lock to prevent the keepalive from bringing the shard back to life. |
Ok, great @qeternity. If we can get a solution here in the next week or so that would be great, otherwise I'll push the release anyway, and we'll have to fix it later. 😜 |
So there was quite a bit of cruft in the old aioredis logic around marshaling raw redis connections and keepalives. Using redis-py pools we get built in keepalives by using a low timeouts on the subscriber connection which will auto reconnect and resubscribe. I've opened this quick refactor (#326) of the pubsub shard which resolves all the hangs and cleans up the code a bit. I can't find a way to implement the disconnect/reconnect notifiers under redis-py however. |
There looks to be some desirable code in redis-py that hasn't been released now, specifically pertaining to autoreconnecting in pubsub blocking mode. The above refactor does not auto reconnect/reubscribe at scale in our test harness, so I will continue to investigate. |
Ok - this is now running pretty well in our chaos harness. |
I've rolled in #326 and pushed 4.0.0b2 to PyPI. I'd be grateful if folks could try it out — looking for final releases next week. 👍 |
Following discussion on #317
On 4.0.0b1, the
test_groups_basic
in either test_pubsub.py and test_pubsub_sentinel.py tests can hang intermittently. This is most pronounced on CI environments (GitHub actions for this repo show some examples for PRs), and locally for myself occurs roughly every 6-8 runs of the below snippet.The hang occurs with a
RedisPubSubChannelLayer
when checking that a message is not received on some particular channel, this is a small test to more easily produce the issue fortest_pubsub
:Preliminary tracing found
receive
on attempting tounsubscribe
fails to ever return a connection from_get_sub_conn
.A
_receive_task
appears to never return on multiple attempts, holding a lock indefinitely.The following print annotations,
Produce, on hang an output of:
Successful runs have the last line swapped for
"receive_task cancelled"
and a clean exit.Ideas so far from the above is:
_recieve_task
has here and here as the prime blocking candidatesThe text was updated successfully, but these errors were encountered: